Hacker News new | ask | show | jobs
by patelajay285 866 days ago
Fair, DPO is considered a fairly well established technique now that is far more stable in training than PPO, but also helps align LLMs from human feedback. The package also helps do PPO, so you can do traditional RLHF, but figured more people would be interested in seeing a DPO example, given how unstable PPO is.