Hacker News new | ask | show | jobs
by tsaoyu 567 days ago
In short, DPO is not better than PPO. This is because DPO is derived from so called BT reward assumption that pairwise data preference is collected. Through mathematical formulations, you can learn the preference and the action at the same time. However, PPO and other on-policy (training samples are strictly generated by the LLM) doesn't need such assumption. For example, in coding and math problems it is possible to get binary reward. Many research shows DPO is ok if you don't take much care on OOD performance.