| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by highfrequency 498 days ago

Cool result, but worth highlighting two points:

- Model is finetuned from Qwen-2.5 Instruct, which includes millions of specially filtered math examples in both pretraining and supervised fine-tuning already.

- To generate the perfect 817 math examples for LIMO, they used state of the art models like R1 to filter down from an initial pool of 10 million math problems. In other words, a whole lot of intelligence was used to craft a maximally informative and distilled set of fine-tuning data. It’s not very clear to me if this is more or less impressive than getting the same result by simply fine-tuning on the 10 million initial pool, but I suppose that would make for a worse headline.

9 comments

armcat 498 days ago

Yes, the authors explicitly highlighted those two points in the abstract, in terms of them being the elicitation threshold for complex reasoning, namely, an extremely complete pre-trained foundation model, and a set of extremely high quality examples post-training.

To your question on finetuning on the initial 10 million pool - intuitively, it would require tremendous amount of finetuning data to move the needle - you really won't be able to move the gradients much with just 817 examples, that initial pool is effectively enforcing pretty rigid regularization.

There is now an increasing interest in showing that small data with inference time scaling is providing significant yield. Couple of recent examples:

* TinyZero: https://github.com/Jiayi-Pan/TinyZero * s1 Simple Test Time Scaling: https://arxiv.org/abs/2501.19393

highfrequency 497 days ago

The abstract doesn’t specify that the 857 training examples were filtered down by R1 from 10 million initial questions. This helps to understand the result better: it is in large part a testament to R1 and similar models’ remarkable ability sift through and identify/construct perfect training data for other models.

Eisenstein 497 days ago

Isn't every progression in technology a result of the previous advance in technology enabling it?

highfrequency 497 days ago

Yes, but these three types of progress are worth distinguishing:

1. Mt. Everest is summited for the first time.

2. An easier or more direct route to the Everest summit is discovered.

3. Someone finds that if a more experienced climber is already at the summit and drops down a series of rope ladders and oxygen tanks and cabins at key points, then it is even easier to make the summit because you can now pack lighter.

All three are interesting, worth discussing etc. But it would be a bit of a stretch to conclude from the third one that “less is more” because you don’t need to bring so much gear when someone else brings it for you.

For example, Attention is All You Need had a similar title. But the whole point was that they did not use recurrent networks at any stage in the learning process.

My point is not to discredit this result but to frame it properly: reasoning models like R1/O1 are incredibly efficient at distilling knowledge to smaller non-reasoning models.

amingilani 498 days ago

Why is everyone is so critical of using information from a previous model to make a more efficient model. There’s nothing wrong with making progress using prior work. And increasing efficiency is progress.

You wouldn’t criticize someone’s kombucha because they didn’t piece their SCOBY (symbiotic culture of bacteria and yeast) together microbe by microbe.

carschno 498 days ago

You are looking at it from a product perspective. From a scientific perspective, it just means the respective benchmark is meaningless, so we don't know how well such a model generalizes.

EGreg 498 days ago

Not so! From a scientific perspective the result you can achieve matters, no one is a blank slate.

For humans this is true as well. The way you teach matters. Look at how the bell curve got absolutely demolished for example when math was taught this way:

https://archive.nytimes.com/opinionator.blogs.nytimes.com/20...

h0l0cube 497 days ago

Another way to look at this is: The first assembly language compiler was handcoded in binary to begin with, and then that compiler's machine code was translated to the more expressive language (assembly). Similar for Fortran/C/etc. from assembly code. Progressively, more expressive languages have been bootstrapped from prior lower-level languages. In a similar way, perhaps a more concise LLM can be built by utilizing a less efficient one?

btown 498 days ago

There is a valid criticism that when you rely heavily on synthetic outputs, you bring along the precursor model's biases and assumptions without fully knowing the limitations of the data set the precursor model was trained on, as well as intentional adjustments made by the designers of the precursor model to favor certain geopolitical goals.

But that's not the criticism that I'm often seeing; it's more that there's an "unfair" amount of press coverage towards new models that rely, in the critics' views, more on distillation than on "true" innovation.

It's worth noting that there are many parties with significant motivation to build public sympathy that only "true" innovation should be valued, and it is only their highly-valued investments that can uniquely execute in that space. Cutting-edge models built in caves with a box of their scraps are counter to that narrative. It's worth considering https://paulgraham.com/submarine.html in this context, and understanding whether it is truly "everyone" that is critical in this way.

sebastiennight 498 days ago

Side note about this (great) PG article: its conclusion is that readers are leaving print media to come read online blogs because online content is "more honest" and less formulaic.

After 2 years of widespread GPT slop at the top of search engine results, we've definitely come full circle.

chefandy 498 days ago

Having been an avid net user since the early 90s, I can’t think of a time where that assertion wasn’t specious. In 2005— the year Gmail debuted and people stated using the term “web 2.0”— most of the content on the net was still from traditional media sources— PR garbage and all. Most blogs were still people just rattling off their opinions which was more likely based on the available content than their own high-quality research. And lack of oversight is a double-edged sword: sure you might have been less likely to get pure unfiltered marketing dreck but you were way more likely to get straight-up bullshit, which is a different, but serious problem. I think he was trying to champion the idealistic anti-establishment soul from the early net despite it essentially being an anachronism, even in 2005.

Rumengol 497 days ago

The issue is that they claim that you don't need an extensive amount of data to do efficient reasoning. But that alone is a bit misleading, if you need a massive model to fine tune and another one to piece together the small amount of data.

I've seen the textbook analogy used, but to me it's like a very knowledgeable person reading an advanced textbook to become an expert. Then they say they're better than the other very knowledgeable persons because he read that manual, and everyone can start from scratch using it.

So there's nothing wrong with making a more efficient model from an existing one, the issue is concluding you don't need all the data that made the existing one possible in the first place. While that may be true, this is not how you prove it.

tw1984 497 days ago

> The issue is that they claim that you don't need an extensive amount of data to do efficient reasoning.

they claim that efficient reasoning can be achieve by applying a small set of SFT samples. how that sample set is collected/filtered is irrelevant here. they just reported the fact that this is possible. this by itself is a new and interesting finding.

ciphix 497 days ago

I completely agree with the point made here. Apart from the research controversial in the paper, however, from an engineering practice perspective, the methodology presented in the paper offers the industry an effective approach to distill structural cognitive capabilities from advanced models and integrate them into less competent ones.

Moreover, I find the Less-Is-More Reasoning (LIMO) hypothesis particularly meaningful. It suggests that encoding the cognitive process doesn't require extensive data; instead, a small amount of data can elicit the model's capabilities. This hypothesis and observation, in my opinion, are highly significant and offer valuable insights, much more than the specific experiment itself.

novakboskov 497 days ago

I'd say that the critique points out that this "information from a previous model" itself needs tremendous amounts of data. Now, did we see any better generalization capabilities with all data counted?

trott 498 days ago

Another way to look at this is that there are 12,290 bits of information in choosing 817 samples from 10,000,000.

TOMDM 498 days ago

And much more information when selecting just as many examples from quadrillions of randomly generated examples.

The information from the selection criteria isn't available to the model, just the chosen samples.

EternalFury 497 days ago

Just imagine a textbook that gives you the understanding you need to score high in math competitions…and it describes less than 1,000 problems. This in itself is a major discovery in metacognition.

robotresearcher 497 days ago

It's one more textbook, not one textbook.

I'm not knocking the work. They report large improvements using relatively little data. That's good. But let's be clear that this is further training of a good sized LLM that has read far, far more than any human that ever lived already.

EternalFury 497 days ago

I know. The question is: How much of the Internet trove, including the smart bits, but also the tremendous amount of inane content, is actually useful to building the foundation that allows 1,000 problems to have such an effect?

sdenton4 497 days ago

Well, there's this, which comes close: https://www.wiley.com/en-us/The+Art+and+Craft+of+Problem+Sol...

Most of the math competitions people are working on are high school math competitions - these have problems from a relatively small set of mathematics, so that high school students can reasonably know the appropriate background.

Terretta 497 days ago

> To generate the perfect 817 math examples for LIMO, they used state of the art models like R1 to filter down from an initial pool of 10 million math problems. In other words, a whole lot of intelligence was used to craft a maximally informative and distilled set of fine-tuning data

The paper, and this comment, seem awfully reminiscent of creating a textbook of curated "maximally informative and distilled" set of cognitive examples to teach students with foundational learning a next level of reasoning.

The last few years of LLM progress have shown we can predict human "reasoning" responses to inputs by modeling likely human responses as if LLM generated. Put another way, most responses are not particularly reasoned, but chain of tokgen*.

Sit near someone who "talks to herself" while doing problems and it's even more evident.

---

* tokgen definition: Listen to conversations in a cafeteria. Many are something other than thoughtful, responses that follow the prompts, with near perfect predictability. To differentiate from these responses and speech that comes after a pause and reflect, one can use the labels thought versus token generation or tokgen.

ciphix 497 days ago

After reviewing the paper and GitHub training dataset, I have the following observations:

The 800+ training samples, each containing solutions with detailed reasoning steps, were primarily generated by DeepSeek r1 and advanced models. The reasoning processes within these training solutions are crucial. It's possible that the advanced models have encoded these reasoning processes through the generated samples. Given a sufficiently large model, it can effectively restore such reasoning weights, effectively adding a delta from DeepSeek r1, among others.

Therefore, it's not surprising that, with relatively few fine-tuning data, Qwen 2.5 has achieved such significant improvements.

This is merely a conjecture. Further research is needed to analyze and visualize the changes in network weights before and after fine-tuning.

GTP 497 days ago

>The last few years of LLM progress have shown we can predict human "reasoning" responses to inputs by modeling likely human responses as if LLM generated. Put another way, most responses are not particularly reasoned, but chain of tokgen*.

Sorry, but I don't get the point of your comment as a whole, and of this part in particular. Yes, most human day-to-day conversations are quite predictable, but some people are still capable of generating original thoughts from time to time. And still, how is it related to the comment you are replying to?

Terretta 496 days ago

> how is it related to the comment you are replying to

Sorry, with quoting, and stating differently:

a whole lot of intelligence was used to craft a maximally informative and distilled set of fine-tuning data

A whole lot of intelligence is used to craft maximally informative and distilled set learning into textbooks, to fine-tune reasoning outcomes from our LLM-ish brains.

Or, put the other way around, what works for us can often inform what works for LLMs.

orbital-decay 497 days ago

>In other words, a whole lot of intelligence was used to craft a maximally informative and distilled set of fine-tuning data.

Sounds like any textbook. (and generally the process of knowledge compression over generations that made us who we are)

smallerize 498 days ago

Yeah, but it's cheaper.

The context right now is that OpenAI, with first-mover advantage, cutting-edge-hardware, and tens of billions of dollars of investment, are not getting benchmark performance better than Chinese-developed models that are trained with cut-down nvidia GPUs and a lot less money.

rfoo 498 days ago

But... they are? o3-mini is faster than DeepSeek-R1 and has comparable capability. And while I hate "AGI achieved internally" meme, o3 is significantly better than o1. Though I doubt how long until DeepSeek-R3 happens. They could skip R2 too citing Cloudflare R2 :P

pama 498 days ago

A big part of why R1 is much slowerr than o3-mini is that inference optimization is not yet performed on most solutions for serving R1 models (so R1 is rather comparable to o1 or o1 pro in terms of latency rather than o1-mini or o3-mini). The MoE is already relatively efficient if perfectly load balanced in an inference setting and should have latencies and throughputs that are equal to or faster than equivalent dense models with 37B parameters. In practice due to MLA inference should be much faster yet for long contexts compared to typical dense models. If DeepSeek or someone else tried to distill the model onto another MoE architecture with even less active parameters and properly implement speculative decoding on top, one could gain additional speedups in inference. I imagine we will see these things but it takes a bit of time till they are all public.

rfoo 497 days ago

I know that, I'm in this game. I was comparing API throughput/ttft/ttbt of DeekSeek's own R1 API before it went viral in the West, and o3-mini.

I remain unconvinced that DeepSeek themselves didn't optimize their own V3 inference good enough and left another 2x~3x improvement on the table.

pama 497 days ago

I am sure DeepSeek did optimize the inference cost of R1. They did not yet release an efficient MoE downscaling of it, ie an R1-mini.

rvnx 498 days ago

I think you could reconsider DeepSeek-R1: it's actually really good.

In comparison, o3-mini gets very vague in its reasoning, and gives surprisingly unhelpful answers (getting too short).

Plus, let's not forget, R1 is available to use and modify under MIT license, which is great.

smallerize 498 days ago

I actually forgot that o3-mini was available now. I was using o1 numbers.

yishanchuan 497 days ago

Sure,but just in mathematical reasoning. If future works contain mathematical logic reasoning, it will perfect.

mattigames 497 days ago

You are missing three point, it's about stating the importance of the preselection, now we know that we may not need huge amounts of data for similar results in other reasoning areas, only highly curated data, yes, sometimes by models themselves but not necessarily.