| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Philpax 474 days ago

This may be useful: https://www.interconnects.ai/p/deepseek-r1-recipe-for-o1

but the tl;dr of the idea is that we can use reinforcement learning on a strong base model (i.e. one that hasn't been fine tuned) to elicit the generation of tokens that help the model reach a result that can be verified to be correct. That is, if we have a way of verifying that a specific output is correct, the model can be trained to consistently produce tokens that will lead to that result for a given input, and that this facility generalises the more problems you train it on.

There are some more nuances (the Interconnects article goes into that), but that's the fundamental idea of Reinforcement Learning from Verifiable Rewards.

1 comments

UltraSane 474 days ago

This paper [1] even claims that "models primed with incorrect solutions containing proper reasoning patterns achieve comparable performance to those trained on correct solutions."

[1] https://arxiv.org/abs/2503.01307

link