|
|
|
|
|
by DebtDeflation
474 days ago
|
|
Would love to see a similar explanation of how "reasoning" versions of LLMs are trained. I understand that OpenAI was mum about how they specifically trained o1/o3 and that people are having to reverse engineer from the DeepSeek paper which may or may not be a different approach, but would like to see a coherent explanation which is not just an regurgitation of Chain of Thought or handwavy "special reasoning tokens give the model more time to think". |
|
but the tl;dr of the idea is that we can use reinforcement learning on a strong base model (i.e. one that hasn't been fine tuned) to elicit the generation of tokens that help the model reach a result that can be verified to be correct. That is, if we have a way of verifying that a specific output is correct, the model can be trained to consistently produce tokens that will lead to that result for a given input, and that this facility generalises the more problems you train it on.
There are some more nuances (the Interconnects article goes into that), but that's the fundamental idea of Reinforcement Learning from Verifiable Rewards.