Hacker News new | ask | show | jobs
by storus 224 days ago
We might not even need RL as DPO has shown.
1 comments

> if you purely use policy optimization, RLHF will be biased towards short horizons

> most RL has some adversarial loss (how do you train your preference network?), which makes the loss landscape fractal which SGD smooths incorrectly