Hacker News new | ask | show | jobs
by Der_Einzige 860 days ago
The latter is strictly superior to the former though. RlHF has been abandoned in the open source world.
3 comments

I don't know about strictly superior. It's certainly strictly easier for people with a budget, who just need "good enough" results the first try. I don't have any evidence whatsoever, but I'd expect that enough tuning and retries can get squeeze a bit more performance out of RLHF than you can get out of DPO.
Yep, DPO is not technically “RL” and implicitly uses the LLM itself as a reward model, but training with DPO is far more stable for that reason.
DPO is as close to RL as RLHF. The latter also uses the LLM as a reward model.

I'm not a fan of the RL/SL dichotomy, because the line gets so foggy. If you squint, every loss is a negative reward, and every policy improvement a supervised target.

Still, what the code does isn't what is described in the paper that the page links to.

> I'm not a fan of the RL/SL dichotomy, because the line gets so foggy. If you squint, every loss is a negative reward, and every policy improvement a supervised target.

Isn't this just because reinforcement learning and supervised learning are both optimization problems?

In part, yes! But also because what used to define it was the human-curated datasets: SL contained input/output pairs, while RL contained episodes with sporadic rewards.

Nowadays, many datasets have different forms or are synthetic. DPO uses datasets with both positive and negative examples (instead of just a target output as with traditional SL); RLHF uses synthetic rewards.

I tend to agree @espadrine, it's semantics for the most part
I am just saying the intro paragraphs are confusing.
Thanks, appreciate the feedback, will update when I get a chance!