Y
Hacker News
new
|
ask
|
show
|
jobs
by
astrange
444 days ago
That post is describing SFT, not RL. RL works using preferences/ratings/verifications, not entire input/output pairs.