Hacker News new | ask | show | jobs
by NitpickLawyer 305 days ago
> Without specification, we employ a decoder-only language model GPT2 (Radford et al., 2019) with a configuration of 4 layers, 32 hidden dimensions, and 4 attention heads.

Yeah, ok. The research is interesting, warranted, but writing an article about it, and leading with the conclusions gathered from toy models and implying this generalises to production LLMs is useless.

We've been here before with small models. Training on LLM outputs leads to catastrophic collapse. Every outlet led with this. But no-one red the fine-print, they were testing on small toy models, and were using everything that came out to re-train. Of course it's gonna fail. L3 / phi / gpt-oss models showed that you can absolutely train on synthetic datasets and have great results.

Research in this area is good, and needed. Mainly to understand limitations, discover if there are any scale levels where "emergent" stuff appears and so on. But writing articles based on incipient research, based on tiny models is not worth the effort.

7 comments

Doing analysis on small models or small data is perfectly valid if the results extrapolate to large models. Which is why right now we're looking at new research papers that are still listing the same small datasets and comparing to the same small models that papers five years ago did.
I have nothing against researching this, I think it's important. My main issue is with articles choosing to grab a "conclusion" and imply it extrapolates to larger models, without any support for that. They are going for the catchy title first, fine-print be damned.
I was just at the KDD conference and the general consensus agreed with this paper. There was only one keynoter who just made the assumption that LLMs are associated with reasoning, which was jarring as the previous keynoter had just explained at length why we need a neuro-symbolic approach instead.

The thing is, I think the current companies making LLMs are _not_ trying to be correct or right. They are just trying to hide it better. In the business future for AI the coding stuff that we focus on on HN - how AI can help/impact us - is just a sideline.

The huge-money business future of LLMs is to end consumers not creators and it is product and opinion placement and their path to that is to friendship. They want their assistant to be your friend, then your best friend, then your only friend, then your lover. If the last 15 years of social media has been about discord and polarisation to get engagement, the next 15 will be about friendship and love even though that leads to isolation.

None of this needs the model to grow strong reasoning skills. That's not where the real money is. And CoT - whilst super great - is just as effective if it's hiding better that its giving you the wrong answer (by being more internally consistent) than if its giving you a better answer?

"as the previous keynoter had just explained at length why we need a neuro-symbolic approach instead"

Do you have a link to the video for that talk ?

I don't think they were recorded. In fact, I don't think any of KDD gets recorded.

I think it was Dan Roth who talked about the challenges of reasoning from just adding more layers and it was Chris Manning who just quickly mentioned at the beginning of his talk that LLMs were well known for reasoning.

https://kdd2025.kdd.org/keynote-speakers/

> None of this needs the model to grow strong reasoning skills. That's not where the real money is

"And the world is more and more complex, and the administrations are less and less prepared"

(~~ Henry Kissinger)

> None of this needs the model to grow strong reasoning skills. That's not where the real money is.

I never thought about it like that, but it sounds plausible.

However, I feel like getting to this stage is even harder to get right compared to reasoning?

Aside from the <0.1% of severely mentally unwell people which already imagine themselves to be in relationships with AIs, I don't think a lot of normal people will form lingering attachments to them without solving the issue of permanence and memory

They're currently essentially stateless, while that's surely enough for short term attachment, I'm not seeing this becoming a bigger issue because if that glaring shortfall.

It'd be like being in a relationship with a person with dementia, thats not a happy state of being.

Honestly, I think this trend is severely overstated until LLMs can sufficiently emulate memories and shared experiences. And that's still fundamentally impossible, just like "real" reasoning with understanding.

So I disagree after thinking about it more - emulated reasoning will likely have a bigger revenue stream via B2E applications compared to emotional attachment in B2C...

(the top post on HN right now is announcing Claude lets you buy a 1M token context. Extrapolate a few years.

Generally, there is a push towards 'context engineering' and there is a lot of bleeding edge research in snapshotting large contexts in ways to get the next back-forth turn in the conversation to be fast etc. So optimisations are already being made.)

Not sure what all this is about, I somewhat regret taking a breaking from coding with LLMs to have it explained to me its all a mirage and a secret and sloppy plan for getting me an automagic egirl or something. ;)
The point being made doesn’t impact people who can find utility from LLM output.

It’s only when you need to apply it to domains outside of code, or a domain where it needs to actually reason, that it becomes an issue.

What does actually reason mean? It's doing this complex anesthesiologist x crna x resident surgery scheduling thingy for ~60 surgeries a day for this one client. Looked a lot like LSAT logic games stuff scaled up to me, took me almost 20-30m to hand check. Is that reasoning?
Right? Oh this fairly novel solution the the problem I was having that works and is well tested. Oh throw it away.. sorry the model can't think of stuff..

Back to square one!!

Can you please share a few sessions ? I want to get a better sense of what people have achieved with generic LLMs that is novel. (Emphasis on "generic", I think I can more readily imagine how specialized models for protein folding can lead to innovation)
As to general consensus, Hinton gave a recent talk, and he seemed adamant that neural networks (which LLMs are) really are doing reasoning. He gives his reasons for it. Is Hinton considered an outlier or?
A) Hinton is quite vocal about desiring to be an outsider/outlier as he says it is what lets him innovate.

B) He is also famous for his Doomerism, which often depends on machines doing "reasoning".

So...it's complicated, and we all suffer from confirmation bias.

This is sloppy, I was asking about scientific consensus from the perspective of the prior commenter as a conference-goer. I am not asking for opinions bordering on ad hominems of Hinton or any other scientist, please refrain from that style of misinformation.
I think Hinton uses terms like reasoning and creativity and consciousness in a way that are different from my own embeddings.

I recently had fun asking Gemini to compare how Wittgenstein and Chomsky would view calling a large transformer that was trained entirely on a synthetic 'language' (in my case symbols that encode user behaviour in an app) a 'language' or not. And then, for the killer blow, whether an LLM that is trained on Perl is a language model.

My point being that whilst Hinton is a great and all, I don't think I can quite pin down his definitions of the precise words like reasoning etc. Its possible for people to have opposite meanings for the same words (Wittgenstein famously had two contradictory approaches in his lifetime). In the case of Hinton, I can't quite pin down how loosely or precisely he is using the terms.

A forward-only transformer like GPT can only do symbolic arithmetic to the depth of its layers, for example. And I don't think the solution is to add more layers.

Of course humans are entirely neuro and we somehow manage to 'reason'. So YMMV.

Link to the talk?
It was a Royal Institution public lecture, "Will AI outsmart human intelligence? - with 'Godfather of AI' Geoffrey Hinton", https://www.youtube.com/watch?v=IkdziSLYzHw

Ultimately I somehwat disagreed with some of Hintons points in this talk, and after some thought I came up with specific reasons/doubts, and yet at the same time, his intuitive explanations helped shift my views somewhat as well.

Because model size is a trivial parameter, and not a new paradigm.

What you're saying is like, you can't extrapolate that long division works on 100 digit numbers because you only worked through it using 7 digit numbers and a few small polynomials.

Scale changes the performance of LLMs.

Sometimes, we go so far as to say there is "emergence" of qualitative differences. But really, this is not necessary (and not proven to actually occur).

What is true is that the performance of LLMs at OOD tasks changes with scale.

So no, it's not the same as solving a math problem.

> What is true is that the performance of LLMs at OOD tasks changes with scale.

If scaling alone guaranteed strong OOD generalization, we’d expect the largest models to consistently top OOD benchmarks but this isn’t the case. In practice, scaling primarily increases a model’s capacity to represent and exploit statistical relationships present in the training distribution. This reliably boosts in-distribution performance but yields limited gains on tasks that are distributionally distant from the training data, especially if the underlying dataset is unchanged. That’s why trillion parameter models trained on the same corpus may excel at tasks similar to those seen in training, but won’t necessarily show proportional improvements on genuinely novel OOD tasks.

If you scale the LLM, you have to scale the tasks.

Of course performance improves on the same tasks.

The researchers behind the submitted work chose a certain size and certain size problems, controlling everything. There is no reason to believe that their results won't generalize to larger or smaller models.

Of course, not for the input problems being held constant! That is as strawman.

Alas, not true. It would be easier to predict progress if so.
This is 100% how it doesn't work with LLMs.
The extrapolation doesn't work if the transformer is too shallow (too few layers) relative to sequence length, because of https://arxiv.org/abs/2503.03961 . A bunch of tasks become unfeasible when the layer count is too low, and 4 layers is way too low. I.e. linearly increasing the number of layers in a model can result in a superlinear increase in performance on tasks like reasoning.
Aren't most major LLMs moving to an architecture where the model is made up of tons of smaller models?

There's a mountain of reasons why this makes sense from a cost perspective, and seemingly it does also for quality, too, as the newer models train substantially more cheaply and still outperform the older models.

Naively, this seems like it would be relevant.

> conclusions gathered from toy models and implying this generalises to production LLMs is useless

You are just trotting out the tired argument that model size magically fixes the issues, rather than just improves the mirage, and so nothing can be known about models with M parameters by studying models with N < M parameters.

Given enough parameters, a miraculous threshold is reached whereby LLMs switch from interpolating to extrapolating.

Sure!

That’s what has been seen in practice though. SOTA LLMs have been shown again and again to solve problems unseen in their data set; and despite their shortcomings they have become extremely useful for a wide variety of tasks.
Even a tiny model for, say, classifying hand-written digits, will correctly classify digits that didn't appear in its training data. (Otherwise it wouldn't be very useful.) That classification is interpolative; the hand-written digit is lands in the space of the training data.

Every result is explainable by has having come from training data. That's the null hypothesis.

The alternative hypothesis is that it's not explainable as having come from training data. That's a hard-to-believe, hard-to-prove negative.

You don't get anything out of any computational process that you didn't put in.

You actually do not classify digits that didn't appear, you classify different pictures of digits that DID appear.

Similarly, LLMs do not invent a new way of reasoning about problems or language. They do, however, apply these to unseen problems.

LLMs are one level of abstraction up, but it's a very interesting level of abstraction.

>you classify different pictures of digits that DID appear.

Are you implying models that classify hand-written digits don’t generalize and only work on training data?

No, that is false; a neural net trained on a decent set of handwritten digits will recognize a newly handwritten digit.

I'm saying that this is a strawman version of "not in the training data". The newly handwritten digit is squarely the same sort of stuff that is in the training data: an interpolation.

We are not surprised when we fit a curve to a bunch of points and then find points on the curve that are not exactly any of those points, but are located among the points.

Go too far outside of the cluster of points though and the curve is a hallucination.

This is the intuition behind interpolate vs extrapolate.

Mind linking any examples (or categories) of problems that are definitively not in pre training data but can still be solved by LLMs? Preferably something factual rather than creative, genuinely curious.

Dumb question but anything like this that’s written about on the internet will ultimately end up as training fodder, no?

How about the International Math Olympiad?

https://arstechnica.com/ai/2025/07/google-deepmind-earns-gol...

You're saying they don't use math textbooks and math forums to train LLMs, then?
The problems are not in textbooks. I’m curious what would count as an out of distribution problem for you. Only problems no one knows how to solve?
You can apply this same argument to humans, 99.999% of people will not be able to escape it.

In the case of the Math Olympiad, the students who take it grind hours a day for months on practice problems and past Olympiad problems.

> SOTA LLMs have been shown again and again to solve problems unseen in their data set

We have no idea what the training data is though, so you can't say that.

> and despite their shortcomings they have become extremely useful for a wide variety of tasks.

That seems like a separate question.

I have applied O3 pro on unpublished abandoned research of mine that was never published and lives in an intersection that is as entirely novel as it's uninteresting.

O3 pro (but not O3) was successfully able to apply reasoning and math to this domain in interesting ways, much like an expert researcher in these areas would.

Again, the field and the problem is with 100% certainty OOD of the data.

However, the techniques and reasoning methods are of course learned from data. But that's the point, right?

The paper is evaluating how well an LLM can handle novelty, and on the paper's terms you need to calculate or otherwise somehow deduce the degree or type of novelty rather than simply describing your never published research as novel.

I don't even know that this is possible without seeing the training data. Hence the difficulty in describing how good at "reasoning" O3 Pro is.

The most novel problem would presumably be something only a martian could understand, written in an alien language, the least novel problem would be a basic question taught in preschool like what color is the sky.

Your research falls somewhere between those extremes.

LLMs don't learn reasoning. At all. They are statistical language models. Nothing else. If they get math right it's because correct math is more statistically probable given the training data, it can't actually do math. This should be pretty clear from all the "how many Rs are there in strawberry" type examples.
I think it is worth writing about simply because it might get the (cost constrained) researcher’s work in front of someone who has the near-unlimited research budgets at one of the big AI companies.
The results from a smaller model are still viable if the paradigm is identical. Unless you believe that larger volumes of data leads to more (unexplained) emergent properties of the AI. i.e, if you think that a larger volume of training data somehow means the model develops actual reasoning skills, beyond the normal next-token prediction.

I do think that larger models will perform better, but not because they fundamentally work differently than the smaller models, and thus the idea behind TFA still stands (in my opinion).

>Training on LLM outputs leads to catastrophic collapse. Every outlet led with this. But no-one red the fine-print, they were testing on small toy models, and were using everything that came out to re-train. Of course it's gonna fail. L3 / phi / gpt-oss models showed that you can absolutely train on synthetic datasets and have great results

You're conflating two very different things. Training on synthetic data one time is very different than cyclically training models on their own data. It has nothing to do with model size.

Perhaps I worded it poorly. My main point was that articles focus on the wrong thing. Most coverage of that paper was "Using LLM generated data leads to CATASTROPHIC collapse". Without reading the fineprint.

> [...] cyclically training models on their own data. It has nothing to do with model size.

Of course it does. GRPO is basically "training models on their own data". You sample, you check for a known truth, you adapt the weights. Repeat. And before GRPO there was RLAIF which showed improving scores at 3 "stages" of generate - select - re-train. With diminishing returns after 3 stages, but no catastrophic collapse.

My main point was about articles and cherrypicking catchy phrases, not criticising research. We need the research. But we also need good articles that aren't written just for the negativity sells titles.

cheeky edit: see this thread [1]. I know slashdot has fallen a lot in the last years, but I skimmed the root comments. Not one addressing the "toy" model problem. Everyone reads the title, and reinforces their own biases. That's the main problem I was trying to address.

1 - https://slashdot.org/story/25/08/11/2253229/llms-simulated-r...

If you have a ground truth that you're comparing to, that's not training on your own data.
"Training on synthetic data one time is very different than cyclically training models on their own data.", but every one with even a modicum of understanding of feedback knows that cyclic training on its own output will end in tears; it's bordering on a tautologic inverse.
Is there an actual general principle or theorem or anything that you can link on this? I’m skeptical because these “model collapse” ideas sound vaguely technical and intuitive, but mostly seem to be based on observations about things that happened to happen with current LLMs. It gets bandied about like it is the most obvious thing, but the support mostly seems to be… pseudo-technical vibes.
Almost every mention I've seen of gpt-oss was a complaint that the training on synthetic datasets produced a model that's mostly good at benchmarks. Are benchmarks the great results you're referring to or are there a lot of satisfied users out there that just don't post here on HN? Genuinely curious.

I can see how performing well on benchmarks at the expense of everything else counts as great results if that's the point of the model.

Well now they could use GPT-OSS, but it wasn't out when they began the study.

I've recently been taking a look at another paper, from 2023, and subsequent research. It has a morally similar finding, though not focused on "reasoning traces", but it's based on GPT-4:

https://proceedings.neurips.cc/paper_files/paper/2023/hash/d...