|
|
|
|
|
by espadrine
863 days ago
|
|
In part, yes! But also because what used to define it was the human-curated datasets: SL contained input/output pairs, while RL contained episodes with sporadic rewards. Nowadays, many datasets have different forms or are synthetic. DPO uses datasets with both positive and negative examples (instead of just a target output as with traditional SL); RLHF uses synthetic rewards. |
|