Hacker News new | ask | show | jobs
by gertlabs 30 days ago
We've been really impressed with the performance of ~30B parameter class models and how close they are to the frontier from ~6-12 months ago, which begs the question, are the frontier labs really serving 10T parameter models? Seems unlikely.

If these Gemini 3.5 numbers are accurate, then I'd wager GPT 5.5 and Opus 4.7 are a lot smaller than people have speculated, too. It's not that frontier labs can't create a 5T+ parameter model, but they don't have the data to optimize a model of that size.

Gemini 3.5 Flash is really smart in one-shot coding reasoning, btw. Near the frontier. But it doesn't do so well in long horizon agentic tasks with arbitrary tool availability. This is a common theme with Google models, and the opposite of what we see with Chinese models (start dumb, iterate consistently toward a smart solution).

Data at https://gertlabs.com/rankings

6 comments

We know from NVIDIA's public Vera Rubin inference engine marketing materials that the frontier lab models are ~1-2T total.

Mythos is an exception that's larger.

Elon says Opus is 5T (and I would expect he'd know)

> It's not that frontier labs can't create a 5T+ parameter model, but they don't have the data to optimize a model of that size.

The have plenty if data. They use very large amounts of verifiable synthetic data in (lots in coding and math) cover the gap.

Also the frontier labs are paying people to do tasks, tracking the trajectories and training on that. Most of the optimization is in RL based on these trajectories.

> Elon says Opus is 5T (and I would expect he'd know)

Even if he knew, why would anyone expect Elon not to lie about anything?

> The have plenty if data.

I don't think data is the problem either, but compute is: if you want to train your 5T params model like modern small models are being trained (with a thousands time more training tokens than params), that's an enormous training run.

> if you want to train your 5T params model like modern small models are being trained (with a thousands time more training tokens than params), that's an enormous training run.

Yes it is. Spending $100M on training runs is common, and $1B might be in scope for some of the large models.

Sonnet 3.5 cost "a few 10s of millions of dollars" back in 2024: https://simonwillison.net/2025/Jan/29/on-deepseek-and-export...

I mean in general I'm pretty doubtful about things he says, but in this he was comparing Grok and it sort of makes sense in the context: https://x.com/elonmusk/status/2042123561666855235
In that context specifically, why would you trust him not to lie?

He's using a massive number for Opus to make Grok look good “for its size”.

If he said something praising Anthropic and like “Grok is 7T, while Opus is better while being only 5T, we need to work harder” or something then maybe I could believe it. But here it's a context where he has all the incentives to inflate Opus' size to make himself look somehow “in the race” when he really isn't despite the money and compute advantage.

Given this tweet I wouldn't be surprises if Grok was actually 1T and Opus being in the same ballpark.

And I'm absolutely not buying current-days Sonnet being a 1T parameters model (that's an absolutely deranged take: that would make Anthropic already behind Chinese model makers, which I think isn't something anyone would put money on).

This is what we do at gertlabs.com - the foundation labs are actually starving for better data. Having quality data is not the same as having a lot of data. Human curated data / RLHF cannot scale to a 5T model and synthetic data pipelines are very much a work in progress in the industry.

Some interesting notes:

- Training a small model with large model output resulted in LESS improvement than distilling a less smart model onto the same small architecture [0]. We are starting to hit intelligence density limits in small models (<30B models may be nearing saturation now)

- good RL environments incidentally also make for good benchmarking

[0] https://arxiv.org/html/2502.12143v1

Wouldn’t it be good to start investigating into a micro model architecture? Like first model checks the context and routes to the Java optimized model, etc. would make it also simpler to load/unload models in memory.

So extremely small models that are only good for a certain task like programming languages. A little bit of a model at the front that is extremely good in classification of tasks and than a more complex model that can bring each of these micro models back together

My guess is that we underestimate how much non-Java data and context in general is needed to create a good Java coding model. It could be true that a good Java model would be of 80-90% the size of a comparable overall coding model.

Obviously, I have no idea but I guess it’s not as simple as “just train only on Java code and reduce size to 1/10th”.

I think you're describing Mixture-of-Experts.
> they don't have the data to optimize a model of that size.

So where does humanity cap out? The statement more or less implies that there's a ceiling of our ability to train models which might be below what LLMs are capable of (e.g. not AGI but how good coding agents they might ever become, for example).

I’m not sure if synthetic data is enough.

Xai paying cursor to train models with their data, tell us that having an agent tool like claude code is important for quality data acquisition. That’s why they recently shipped grok build

I think we will see insane SOTA models from xai in the next few months.

I agree with this sentiment but the reasoned anecdotes do not agree. I imagine the flagship models have modalities/usages that we hn-ers don't imagine easily.
It was estimated that Mythos is 10T.

And serving is not training. For distilling you need to train the big models to have something to be distilled.

Wouldn’t that be an exciting plot twist? That the release cadence of the big labs doesn’t actually reflect any meaningful improvements, or bigger models, but it’s a marketing ploy to start ratcheting up prices for good ARR numbers prior to the big IPO where the celebrity executives bail out of the stalling plane.
I exclusively use gemini models and this has been my experience.

I mitigate it by creating dense planning docs for everything and executing iteratively.

Lot's of time wasted on procedure unfortunately