|
|
|
|
|
by mountainriver
200 days ago
|
|
I mean DPO, PPO, and GRPO all use losses that are not what’s used with SFT for one. They also force exploration as a part of the algorithm. They can be used for synthetic data generation once the reward model is good enough. |
|