| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by patelajay285 866 days ago
	Fair, DPO is considered a fairly well established technique now that is far more stable in training than PPO, but also helps align LLMs from human feedback. The package also helps do PPO, so you can do traditional RLHF, but figured more people would be interested in seeing a DPO example, given how unstable PPO is.