|
|
|
|
|
by patelajay285
866 days ago
|
|
Fair, DPO is considered a fairly well established technique now that is far more stable in training than PPO, but also helps align LLMs from human feedback. The package also helps do PPO, so you can do traditional RLHF, but figured more people would be interested in seeing a DPO example, given how unstable PPO is. |
|