Hacker News new | ask | show | jobs
by armcat 490 days ago
Yes, the authors explicitly highlighted those two points in the abstract, in terms of them being the elicitation threshold for complex reasoning, namely, an extremely complete pre-trained foundation model, and a set of extremely high quality examples post-training.

To your question on finetuning on the initial 10 million pool - intuitively, it would require tremendous amount of finetuning data to move the needle - you really won't be able to move the gradients much with just 817 examples, that initial pool is effectively enforcing pretty rigid regularization.

There is now an increasing interest in showing that small data with inference time scaling is providing significant yield. Couple of recent examples:

* TinyZero: https://github.com/Jiayi-Pan/TinyZero * s1 Simple Test Time Scaling: https://arxiv.org/abs/2501.19393

1 comments

The abstract doesn’t specify that the 857 training examples were filtered down by R1 from 10 million initial questions. This helps to understand the result better: it is in large part a testament to R1 and similar models’ remarkable ability sift through and identify/construct perfect training data for other models.
Isn't every progression in technology a result of the previous advance in technology enabling it?
Yes, but these three types of progress are worth distinguishing:

1. Mt. Everest is summited for the first time.

2. An easier or more direct route to the Everest summit is discovered.

3. Someone finds that if a more experienced climber is already at the summit and drops down a series of rope ladders and oxygen tanks and cabins at key points, then it is even easier to make the summit because you can now pack lighter.

All three are interesting, worth discussing etc. But it would be a bit of a stretch to conclude from the third one that “less is more” because you don’t need to bring so much gear when someone else brings it for you.

For example, Attention is All You Need had a similar title. But the whole point was that they did not use recurrent networks at any stage in the learning process.

My point is not to discredit this result but to frame it properly: reasoning models like R1/O1 are incredibly efficient at distilling knowledge to smaller non-reasoning models.