|
|
|
|
|
by nextaccountic
859 days ago
|
|
> I'm not a fan of the RL/SL dichotomy, because the line gets so foggy. If you squint, every loss is a negative reward, and every policy improvement a supervised target. Isn't this just because reinforcement learning and supervised learning are both optimization problems? |
|
Nowadays, many datasets have different forms or are synthetic. DPO uses datasets with both positive and negative examples (instead of just a target output as with traditional SL); RLHF uses synthetic rewards.