| HN Mirror

This may be useful: https://www.interconnects.ai/p/deepseek-r1-recipe-for-o1

but the tl;dr of the idea is that we can use reinforcement learning on a strong base model (i.e. one that hasn't been fine tuned) to elicit the generation of tokens that help the model reach a result that can be verified to be correct. That is, if we have a way of verifying that a specific output is correct, the model can be trained to consistently produce tokens that will lead to that result for a given input, and that this facility generalises the more problems you train it on.

There are some more nuances (the Interconnects article goes into that), but that's the fundamental idea of Reinforcement Learning from Verifiable Rewards.