|
Cool result, but worth highlighting two points: - Model is finetuned from Qwen-2.5 Instruct, which includes millions of specially filtered math examples in both pretraining and supervised fine-tuning already. - To generate the perfect 817 math examples for LIMO, they used state of the art models like R1 to filter down from an initial pool of 10 million math problems. In other words, a whole lot of intelligence was used to craft a maximally informative and distilled set of fine-tuning data. It’s not very clear to me if this is more or less impressive than getting the same result by simply fine-tuning on the 10 million initial pool, but I suppose that would make for a worse headline. |
To your question on finetuning on the initial 10 million pool - intuitively, it would require tremendous amount of finetuning data to move the needle - you really won't be able to move the gradients much with just 817 examples, that initial pool is effectively enforcing pretty rigid regularization.
There is now an increasing interest in showing that small data with inference time scaling is providing significant yield. Couple of recent examples:
* TinyZero: https://github.com/Jiayi-Pan/TinyZero * s1 Simple Test Time Scaling: https://arxiv.org/abs/2501.19393