|
|
|
|
|
by CuriouslyC
783 days ago
|
|
"Fixing" low quality data with RLHF is a waste of time. By that point it's already poisoned the model distribution, and all you're doing is steering it away from catastrophic failure cases. Start with the best data you can, and task train ("rlhf") behavior not preference. |
|