| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by yaj54 445 days ago
	This is a super helpful breakdown and really helps me understand how the RL step is different than the initial training step. I didn't realize the reward was delayed until the end of the response for the RL step. Having the reward for this step be dependent on the coherent thought rather than a coherent word now seems like an obvious and critical part of how this works.

1 comments

That post is describing SFT, not RL. RL works using preferences/ratings/verifications, not entire input/output pairs.