The first paragraphs says RLHF can be used to align models, and the seconds say here's how to do it by using DPO. These two methods are not the same, and the latter is not an instance of the former.
Fair, DPO is considered a fairly well established technique now that is far more stable in training than PPO, but also helps align LLMs from human feedback. The package also helps do PPO, so you can do traditional RLHF, but figured more people would be interested in seeing a DPO example, given how unstable PPO is.
I don't know about strictly superior. It's certainly strictly easier for people with a budget, who just need "good enough" results the first try. I don't have any evidence whatsoever, but I'd expect that enough tuning and retries can get squeeze a bit more performance out of RLHF than you can get out of DPO.
DPO is as close to RL as RLHF. The latter also uses the LLM as a reward model.
I'm not a fan of the RL/SL dichotomy, because the line gets so foggy. If you squint, every loss is a negative reward, and every policy improvement a supervised target.
Still, what the code does isn't what is described in the paper that the page links to.
> I'm not a fan of the RL/SL dichotomy, because the line gets so foggy. If you squint, every loss is a negative reward, and every policy improvement a supervised target.
Isn't this just because reinforcement learning and supervised learning are both optimization problems?
In part, yes! But also because what used to define it was the human-curated datasets: SL contained input/output pairs, while RL contained episodes with sporadic rewards.
Nowadays, many datasets have different forms or are synthetic. DPO uses datasets with both positive and negative examples (instead of just a target output as with traditional SL); RLHF uses synthetic rewards.