Hacker News new | ask | show | jobs
by amingilani 494 days ago
Why is everyone is so critical of using information from a previous model to make a more efficient model. There’s nothing wrong with making progress using prior work. And increasing efficiency is progress.

You wouldn’t criticize someone’s kombucha because they didn’t piece their SCOBY (symbiotic culture of bacteria and yeast) together microbe by microbe.

4 comments

You are looking at it from a product perspective. From a scientific perspective, it just means the respective benchmark is meaningless, so we don't know how well such a model generalizes.
Not so! From a scientific perspective the result you can achieve matters, no one is a blank slate.

For humans this is true as well. The way you teach matters. Look at how the bell curve got absolutely demolished for example when math was taught this way:

https://archive.nytimes.com/opinionator.blogs.nytimes.com/20...

Another way to look at this is: The first assembly language compiler was handcoded in binary to begin with, and then that compiler's machine code was translated to the more expressive language (assembly). Similar for Fortran/C/etc. from assembly code. Progressively, more expressive languages have been bootstrapped from prior lower-level languages. In a similar way, perhaps a more concise LLM can be built by utilizing a less efficient one?
There is a valid criticism that when you rely heavily on synthetic outputs, you bring along the precursor model's biases and assumptions without fully knowing the limitations of the data set the precursor model was trained on, as well as intentional adjustments made by the designers of the precursor model to favor certain geopolitical goals.

But that's not the criticism that I'm often seeing; it's more that there's an "unfair" amount of press coverage towards new models that rely, in the critics' views, more on distillation than on "true" innovation.

It's worth noting that there are many parties with significant motivation to build public sympathy that only "true" innovation should be valued, and it is only their highly-valued investments that can uniquely execute in that space. Cutting-edge models built in caves with a box of their scraps are counter to that narrative. It's worth considering https://paulgraham.com/submarine.html in this context, and understanding whether it is truly "everyone" that is critical in this way.

Side note about this (great) PG article: its conclusion is that readers are leaving print media to come read online blogs because online content is "more honest" and less formulaic.

After 2 years of widespread GPT slop at the top of search engine results, we've definitely come full circle.

Having been an avid net user since the early 90s, I can’t think of a time where that assertion wasn’t specious. In 2005— the year Gmail debuted and people stated using the term “web 2.0”— most of the content on the net was still from traditional media sources— PR garbage and all. Most blogs were still people just rattling off their opinions which was more likely based on the available content than their own high-quality research. And lack of oversight is a double-edged sword: sure you might have been less likely to get pure unfiltered marketing dreck but you were way more likely to get straight-up bullshit, which is a different, but serious problem. I think he was trying to champion the idealistic anti-establishment soul from the early net despite it essentially being an anachronism, even in 2005.
The issue is that they claim that you don't need an extensive amount of data to do efficient reasoning. But that alone is a bit misleading, if you need a massive model to fine tune and another one to piece together the small amount of data.

I've seen the textbook analogy used, but to me it's like a very knowledgeable person reading an advanced textbook to become an expert. Then they say they're better than the other very knowledgeable persons because he read that manual, and everyone can start from scratch using it.

So there's nothing wrong with making a more efficient model from an existing one, the issue is concluding you don't need all the data that made the existing one possible in the first place. While that may be true, this is not how you prove it.

> The issue is that they claim that you don't need an extensive amount of data to do efficient reasoning.

they claim that efficient reasoning can be achieve by applying a small set of SFT samples. how that sample set is collected/filtered is irrelevant here. they just reported the fact that this is possible. this by itself is a new and interesting finding.

I completely agree with the point made here. Apart from the research controversial in the paper, however, from an engineering practice perspective, the methodology presented in the paper offers the industry an effective approach to distill structural cognitive capabilities from advanced models and integrate them into less competent ones.

Moreover, I find the Less-Is-More Reasoning (LIMO) hypothesis particularly meaningful. It suggests that encoding the cognitive process doesn't require extensive data; instead, a small amount of data can elicit the model's capabilities. This hypothesis and observation, in my opinion, are highly significant and offer valuable insights, much more than the specific experiment itself.

I'd say that the critique points out that this "information from a previous model" itself needs tremendous amounts of data. Now, did we see any better generalization capabilities with all data counted?