|
|
|
|
|
by patelajay285
854 days ago
|
|
This was discussed in another comment, DPO is pretty much strictly better than RLHF + PPO, and far more stable when training. Yes, DPO is not technically "RL", but it's semantics for the most part. DataDreamer does support PPO training if you want, but it's so unstable, it's a less popular choice now. |
|
Given that, shouldn't the first sentence on the linked page end with "...in a process known as DPO (...)" ? Ditto for the title.
It sounds like you're saying that the terms RL and RLHF should subsume DPO because they both solve the same problem, with similar results. But they're different techniques, and there are established terms for both of them.