Hacker News new | ask | show | jobs
by josh-sematic 301 days ago
The mechanisms the author describe are used for RLHF, but are not sufficient for training the recent slew of “reasoning models.” To do that, you have to generate rewards not based on proximity to some reference full answer transcript, but rather based on how well the final answer (ex: the part after the “thinking tokens”) meets your reward criteria. This turns out to be a lot harder to do than the mechanisms used for RLHF which is one reason why we had RLHF for a while before we got the “reasoning models.” It’s also the only way you can understand the Sutskever quote “You’ll know your RL is working when the thinking tokens are no longer English” (a paraphrase, pulled from my memory).
2 comments

FWIW, that was Karpathy, not Sutskever:

https://x.com/karpathy/status/1835561952258723930?s=19

sorry, could you explain why is it harder, where the complexity creeps in (compared to some naive "pattern matching the end of the response" tactic)? thanks!
The pattern matching compares what was said against an example of what a correct response could say.

Checking a token at a time evaluates if it is going to produce a correct final answer. The intermediate text can be whatever it needs to arrive at that answer, but training at the per token level means training those very tokens that you want to allow the model the leeway to consider. It needs another model to adjudicate how well things are going from incomplete answers.

I'm not sure how much the adjudicator evaluates based upon knowing the final answer or based upon the quality of the reasoning of the model being trained. I'd be inclined to train two adjudicators, one that knows the answers and one that doesn't. I'm sure there would be interesting things to see in their differential signal.

Just speculating but proximity to a reference answer is a much denser reward signal. In contrast, parsing out a final answer into a pass/fail only provides a sparse reward signal.
Yup, RLVR as implemented by Deepseek et al. use only outcome supervision instead of process supervision. There have been attempts to do process supervision though.