| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by programjames 224 days ago
	> if you purely use policy optimization, RLHF will be biased towards short horizons > most RL has some adversarial loss (how do you train your preference network?), which makes the loss landscape fractal which SGD smooths incorrectly