Hacker News new | ask | show | jobs
by astrange 444 days ago
That post is describing SFT, not RL. RL works using preferences/ratings/verifications, not entire input/output pairs.