| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by astrange 444 days ago
	That post is describing SFT, not RL. RL works using preferences/ratings/verifications, not entire input/output pairs.