Hacker News new | ask | show | jobs
by bigyabai 197 days ago
RLHF is basically a fancy, overengineered GAN. Most of the industry could see that DPO was more efficient for fitting to human behavior.