|
|
|
|
|
by tempusalaria
558 days ago
|
|
1) DPO did exclude some practical aspects of the RLHF method, e.g. pretraining gradients. 2) the theoretical arguments of DPO equivalence make some assumptions that don’t necessarily apply in practice 3) RLHF gives you a reusable reward model, which has practical uses and advantages. DPO doesn’t have useful intermediate product. 4) DPO works off preference, whereas desirable RL objectives could have many forms in practice big labs are testing all these methods to see what works best. |
|