|
|
|
|
|
by josh-sematic
301 days ago
|
|
The mechanisms the author describe are used for RLHF, but are not sufficient for training the recent slew of “reasoning models.” To do that, you have to generate rewards not based on proximity to some reference full answer transcript, but rather based on how well the final answer (ex: the part after the “thinking tokens”) meets your reward criteria. This turns out to be a lot harder to do than the mechanisms used for RLHF which is one reason why we had RLHF for a while before we got the “reasoning models.” It’s also the only way you can understand the Sutskever quote “You’ll know your RL is working when the thinking tokens are no longer English” (a paraphrase, pulled from my memory). |
|
https://x.com/karpathy/status/1835561952258723930?s=19