| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by achrono 57 days ago
	How do we know that today's frontier models are merely scaled up versions of that? Genuine question, since the labs have narrowed what they share over the years to now almost nothing, in terms of how the model was trained and how it works under the hood.

4 comments

HarHarVeryFunny 56 days ago

We know for sure the architecture of the open weights models since llama.cpp understands the architecture it needs to build to plug the weights into to run them. It's always possible that the latest closed model is doing something architecturally different than the open weights ones we know about, but judging by how close the large open weight models such as DeepSeek are to SOTA performance, this seems unlikely. When OpenAI first came out with their near-mythical "Strawberry" (aka "o1") thinking model there was all sorts of speculation that they had made some sort of architectural breakthough, but then DeepSeek replicated the capability and published how they did it, proving that it was just better training, not any architectural change.

There have been minor changes to the architecture over the years, but these are basically all efficiency tweaks such as various types of attention (some pioneered in the open by DeepSeek) that better scale to large context lengths, and the confusingly named "mixture of experts" architecture, but what's more notable really is how little the architecture has changed. The capability gains have been coming from better training and better data.

link

gobdovan 56 days ago

DeepSeek research:

- V3 https://arxiv.org/abs/2412.19437

- V2 https://arxiv.org/abs/2405.04434

- R1 https://arxiv.org/abs/2501.12948 (RL applied to ML models was well-known beforehand, but they show it in the open, at scale, on big models)

Then, there's the incentive analysis. If you can see that these models empirically get better with scale, why would you swap the main architecture? Those events will be pretty rare. I'm not saying there's noone cooking a new architecture, just that it is a pretty rare event. And it would have to come from some researchers that would be happy to not publish their findings, which is not really what a sizable portion of elite researchers (obviously not all) are incentivized to do.

Of course, it's a bit of a verbal compression to claim simply 'scaled up'. They are recognisable scaled up transformers, but most new models come with a few tricks, but we're at the point where those usually are not an architectural rewrite and added to solve an explicit problem, like hallucination, not for big new capability gains.

link

swyx 56 days ago

> If you can see that these models empirically get better with scale, why would you swap the main architecture? Those events will be pretty rare

c.f. hardware lotter https://arxiv.org/abs/2009.06489

link

matusp 56 days ago

There are thousands of people working in top level labs. Somebody would leak it

link

ai_slop_hater 57 days ago

No they are clearly not just scaled up versions of gpt 2; there are different LLM architectures like mixture of experts etc that appeared relatively recently. I am not an expert though, far from it.

link

otabdeveloper4 56 days ago

MoE and such are basically performance enhancements, they don't make the model smarter.

link

yababa_y 56 days ago

separately trained experts can surpass performance in their activated regime and DOES result in a smarter model, the Claude system cards talk about this and eg there is https://openreview.net/forum?id=iydmH9boLb to read...

link

jmalicki 56 days ago

Performance enhancements are huge though.

If you can make the existing model faster, you can then save your inference budget to then make your model bigger, which then makes it smarter.

A lot of how smart the models can be comes down to budget. If you can make your existing thing cheaper, you can instead make it bigger for the same price.

link

TheHalfDeafChef 56 days ago

Not really “smarter” though? It’s just a big probability engine.

(Not trying to flame bait or anything. I just wouldn’t call LLM as exhibiting intelligence. It is great at making connections based on probability but doesn’t have a semantic understanding of what it is doing)

link

stevenhuang 55 days ago

You do realize modern neuroscience considers the human brain as "just" a probability engine and that intelligence may well be the ability for an organism to predict well.

> doesn’t have a semantic understanding of what it is doing

I hope you realize this is an area of open, active research.

link

Chu4eeno 54 days ago

Didn't neuroscience some big scandals about bad statistics and overstating their findings (in addition to normal issues like replication)? Look up at least the "dead salmon study" (hint: it's related to fMRI, and you can probably guess its conclusions from its nickname). The "Voodoo Correlations" and "Cluster Failure" papers are also a bit eye-opening.

In general we (humans) need to be humble about the limitations of our knowledge about how we function, it's an insanely complicated problem.

link

otabdeveloper4 56 days ago

> to then make your model bigger, which then makes it smarter

There's diminishing returns and at some point making a model bigger makes it dumber.

link

lobocinza 54 days ago

Maybe due to lack of data and dimensions other than words.

link

fizx 56 days ago

Performance enhancements are what allow you to train a bigger model.

link