Hacker News new | ask | show | jobs
by yaj54 445 days ago
This is a super helpful breakdown and really helps me understand how the RL step is different than the initial training step. I didn't realize the reward was delayed until the end of the response for the RL step. Having the reward for this step be dependent on the coherent thought rather than a coherent word now seems like an obvious and critical part of how this works.
1 comments

That post is describing SFT, not RL. RL works using preferences/ratings/verifications, not entire input/output pairs.