|
|
|
|
|
by programjames
224 days ago
|
|
> if you purely use policy optimization, RLHF will be biased towards short horizons > most RL has some adversarial loss (how do you train your preference network?), which makes the loss landscape fractal which SGD smooths incorrectly |
|