Hacker News new | ask | show | jobs
by zamadatix 146 days ago
The price/scaling of training another same class model always seems to be dropping through the floor but training models which score much better seems to be hitting a brick wall.

E.g. gemini-3-pro tops the lmarena text chart today at 1488 vs 1346 for gpt-4o-2024-05-13. That's a win rate of 70% (where 50% is equal chance of winning) over 1.5 years. Meanwhile, even the open weights stuff OpenAI gave away last summer scores between the two.

The exception seems to be net new benchmarks/benchmark versions. These start out low and then either quickly get saturated or hit a similar wall after a while.

5 comments

> E.g. gemini-3-pro tops the lmarena text chart today at 1488 vs 1346 for gpt-4o-2024-05-13. That's a win rate of 70% (where 50% is equal chance of winning) over 1.5 years. Meanwhile, even the open weights stuff OpenAI gave away last summer scores between the two.

Why do you care about LM Arena? It has so many problems, and the fact that no one would suggest using GPT-4o for doing math or coding right now, or much of anything, should tell you that a 'win rate of 70%' does not mean whatever it looks like it means. (Does GPT-4o solve roughly as many Erdos questions as gemini-3-pro...? Can you write roughly as good poetry?)

It'd certainly be odd if people were recommending old LLMs which score worse, even if marginally. That said, 4o is really a lot more usable than you're making it out to be.

The particular benchmark in the example is fungible but you have to pick something to make a representative example. No matter which you pick someone always has a reason "oh, it's not THAT benchmark you should look at". The benchmarks from the charts in the post exhibit the same as described above.

If someone was making new LLMs which were consistently solving Erdos problems at rapidly increasing rates then they'd be showing how it does that rather than showing how it scores the same or slightly better on benchmarks. Instead the progress is more like years since we were surprised LLMs were writing poetry to massage out an answer to one once. Maybe by the end of the year a few. The progress has definitely become very linear and relatively flat compared to roughly the initial 4o release. I'm just hoping that's a temporary thing rather than a sign it'll get even flatter.

Progress has not become linear. We've just hit the limits of what we can measure and explain easily.

One year ago coding agents could barely do decent auto-complete.

Now they can write whole applications.

That's much more difficult to show than an ELO score based on how people like emjois and bold text in their chat responses.

Don't forget Llama4 led Lmarena and turned out to be very weak.

You are equally understating past performance as you are overstating current performance.

One year ago I already ran qwen2.5-coder 7B locally for pretty decent autocomplete. And I still use it today as I haven't found anything better, having tried plenty of alternatives.

Today I let LLM agents write probably 60-80% of the code, but I frequently have to steer and correct it and that final 20% still takes 80% of the time.

Much of these gains can be attributed to better tooling and harnesses around the models. Yes, the models also had to be retrained to work with the new tooling, but that doesn’t mean there was a step change in their general “intelligence” or capabilities. And sure enough, I’m seeing the same old flaws as always: frontier models fabricating info not present in the context, having blindness to what is present, getting into loops, failing to follow simple instructions…
> Much of these gains can be attributed to better tooling and harnesses around the models.

This isn't the case.

Take Claude Code and use it with Haiku, Sonnet and Opus. There's a huge difference in the capabilities of the models.

> And sure enough, I’m seeing the same old flaws as always: frontier models fabricating info not present in the context, having blindness to what is present, getting into loops, failing to follow simple instructions…

I don't know what frontier models you are using but Opus and Codex 5.2 don't ever do these things for me.

Frankly, this reads as a lot of words that amount to an excuse for using only LMArena, and the rationale is quite clear: it’s for an unrelated argument that isn’t going to ring true to people, especially an audience of programmers who just spent the last year watching the AI go from being able to make coherent file edits to multi hour work.

LMArena is, de facto, a sycophancy and Markdown usage detector.

Two others you can trust, off the top of my head, are LiveBench.ai and Artifical Analysis. Or even Humanity’s Last Exam results. (Though, frankly, I’m a bit suspicious of them. Can’t put my finger on why. Just was a rather rapid hill climb for a private benchmark over the last year.)

FWIW GPT 5.2 unofficial marketing includes the Erdos thing you say isn’t happening.

I've always found LiveBench a bit confusing to try to compare over time as the dataset isn't meant to be compared over time. It also currently claims GPT-5 Mini High from last summer is within ~15% of Claude 4.5 Opus Thinking High Effort in the average, but I'll wait with bated breath for the millions of amazing apps which couldn't be coded before to start showing up (or, more likely, be told in 6 months how these 2 benchmarks weren't the ones that should matter either). Artificial Analysis at least has the same at 20% from the top, so maybe that's the one we all agree to use for now since it implies faster growth.

> FWIW GPT 5.2 unofficial marketing includes the Erdos thing you say isn’t happening.

Certainly not, unless you're about to tell me I can pop into ChatGPT and pop out Erdos proofs regularly since #728 was massaged out with multiple prompts and external tooling a few weeks ago - which is what I was writing about. It was great, it was exciting, but it's exactly the slow growth I'm talking about.

I like using LLMs, I use them regularly, and I'm hoping they continue to get better for a long time... but this is in no way the GPT 3 -> 3.5 -> 4 era of mind boggling growth of frontier models anymore. At best, people are finding out how to attach various tooling to the models to eek more out as the models themselves very slowly improve.

> I'll wait with bated breath for the millions of amazing apps which couldn't be coded before to start showing up

Appstore releases were roughly linear until July 25 and are up 60% since then:

https://www.coatue.com/c/takes/chart-of-the-day-2026-01-22

I never claimed people don't make apps with AI. Of course it does - I can do that in a few clicks and some time with most any provider. You've been able to do that for a few years now, and that (linear) trend line starts over a year ago.

I can guarantee if you restricted yourself to just that 60% you wouldn't be responding to me doubting AI apps are already amazing things people are actually supposed to be so excited about using though.

One of the best surgically executed nukes on HN in my 16 years here.
See peer reply re: yes, your self-chosen benchmark has been reached.

Generally, I've learned to warn myself off of a take when I start writing emotionally charged stuff like [1]. Without any prompting (who mentioned apps? and why would you without checking?), also, when reading minds, and assigning weak arguments, now and in my imagination of the future. [2]

At the very least, [2] is a signal to let the keyboard have a rest, and ideally my mind.

Bailey: > "If [there were] new LLMs...consistently solving Erdos problems at rapidly increasing rates then they'd be showing...that"

Motte: > "I can['t] pop into ChatGPT and pop out Erdos proofs regularly"

No less than Terence Tao, a month ago, pointing out your bailey was newly happening with the latest generation: https://mathstodon.xyz/@tao/115788262274999408. Not sure how you only saw one Erdos problem.

[1] "I'll wait with bated breath for the millions of amazing apps which couldn't be coded before to start showing up"

[2] "...or, more likely, be told in 6 months how these 2 benchmarks weren't the ones that should matter either"

I'm going to stick to the stuff around Tao, as even well tempered discussion about the rest would be against the guidelines anyways.

I had a very different read of Tao's post last month. To me, he opens that there have been many claims of novel solutions which turn out to be known solutions from publications buried for years, but nothing about rapid increase in the rates or even claims mathematicians using LLMs are having most of the work done by them yet.

He speculates, and I also assume correctly as well, that that contaminations are not the only reason. Indeed, we've seen at least 1 novel solution which couldn't have come from a low interest publication being in the training data alone. How many of the 3 examples at the top end up actually falling that way is not really something anyone can know, but I agree it should be safe to assume the answer will not be 0, or even if it was it would seem unreasonable to think it stayed that way. These solutions are coming out of systems of which the LLM is a part, and very often a mathematician still actually orchestrating.

None of these are just popping in a prompt and hoping for the answer, nor will you get an unknown solution to an LLM by going to ChatGPT 5.2 Pro and asking it without the rest of the story (and even then, you still will not get such a solution regularly, consistently, or at a massively higher rate than several months ago). They are multishot from experts with tools. Tao makes a very balanced note of this in reply to his main message:

> The nature of these contributions is rather nuanced; individually and collectively, they do not meet the hyped up goal of AI autonomously solving major mathematical open problems, but they also cannot all be dismissed as inconsequential trickery.

It's exciting, and helpful, but it's slow and he doesn't even think we're truly actually at "AI solves some Erdos problems" yet, let alone "AI solves Erdos problems regularly and at a rapidly increasing rate".

It very sad there is so much gaming of metrics with LLMs.

If we wish to avoid everyone creating benchmarks for themselves, then instead of predetermined benchmarks (public ones allow gaming, while publicly scored private ones require blind trust) we could use gradient descent on sentences to find disagreements between models, and then present them to human domain experts.

At least it could be public without possibility of leaking (since the model creators don't yet know of all possible disagreements between LLM's, which ones will be selected for review by human experts)

>E.g. gemini-3-pro tops the lmarena text chart today at 1488 vs 1346 for gpt-4o-2024-05-13. That's a win rate of 70% (where 50% is equal chance of winning) over 1.5 years. Meanwhile, even the open weights stuff OpenAI gave away last summer scores between the two.

I think in that specific case that says more about LMArena than about the newer models. Remember that GPT 4o was so specifically loved by people that when GPT 5 replaced there was lots of backlash against OpenAI.

One of the popular benchmarks right now is METR which shows some real improvement with newer models, like Opus 4.5. Another way of getting data is anecdotes, lots of people are really impressed with Opus 4.5 and Codex 5.2 (but they're hard distangle from people getting better with those tools, the scaffolding (Claude code, Codex) getting better, and lots of other stuff). SWEBench is still not saturated (less than 75% I think).

> The exception seems to be net new benchmarks/benchmark versions.

How is this an exception? If a genius and kindergarden student takes a test to add two single digit numbers how is that result any relevant? Even though adding single digit number is in the class of possible test.

We can only look at non saturated test.

It’s becoming clear that training a frontier model is a capex/infra problem. This problem involves data acquisition, compute, and salaries for the researchers familiar with the little nuances of training at this scale.

For the same class model, you can train on more or less the same commodity datasets. Over time these datasets become more efficient to train on as errata are removed and the data is cleaner. The cost of dataset acquisition can be amortized and sometimes drops to 0 as the dataset is open sourced.

Frontier models mean acquiring fresh datasets at unknown costs.

Training costs might be coming down but costs for hardware that can run these models is still obscenely high and rising. We're still nowhere near a point where its realistically feasible to run a home LLM that doesn't feel like it's suffering with severe brain damage.