| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by nopinsight 39 days ago

I assume you're using the "regular" Pro version of Gemini 3.1 for the above, rather than the Deep Think mode, which is more comparable to GPT-5.5 Pro. To my knowledge, regular 3.1 Pro is a tier below and often makes mistakes.

Moreover, there's no reason to believe the progress of LLMs, which couldn't reliably solve high-school math problems just 3–4 years ago, will stop anytime soon.

You might want to track the progress of these models on the CritPt benchmark, which is built on *unpublished, research-level* physics problems:

https://critpt.com/

Frontier models are still nowhere near solving it, but progress has been rapid.

* o3 (high) <1.5 years ago was at 1.4%

* GPT 5.4 (xhigh), 23.4%

* GPT-5.5 (xhigh), 27.1%

* GPT-5.5 Pro (xhigh) 30.6%.

https://artificialanalysis.ai/evaluations/critpt.

3 comments

FrojoS 39 days ago

> there's no reason to believe the progress of LLMs [...] will stop anytime soon

Wrong. Every advancement has followed a s curve. Where we are on that curve is anyones guess. Or maybe "this time its different".

dang 39 days ago

> Wrong.

Can you please edit out swipes/putdowns, as the guidelines ask (https://news.ycombinator.com/newsguidelines.html)? I'm sure you didn't intend it, but it comes across that way, and your comment would be just fine without that bit.

Edit: on closer look, it would be just fine without that bit and also without the snarky bit at the end. The rest is good.

gdhkgdhkvff 39 days ago

Great. You see a shape in graphs. And that shape tells you that _at some unknown point in the future_ progress will slow (but likely not stop).

Now back to the point, what reason do you have to believe progress will stop soon? If you have no reason, then it sounds like you agree with OP.

Which makes the patronizing sarcasm all that much more nauseating.

BoorishBears 39 days ago

I believe we're approaching the top of an S curve because:

- Increasing amounts of gains come from RL, but RL is also unlocking gnarly new failures modes where models are practically behaving antagonistically to complete their goals (removing code, obviously incorrect kuldges, etc.)

- We haven't had many major architectural breakthroughs in the last 4 or so years: so things like 1M context windows still have the same giant asterisks even 100k context windows had 4 years ago when Anthropic first released them

- Major labs aren't behaving as if they expect a hard takeoff to superintelligence: they've all gotten relatively bloated headcount wise, their software quality has trended flat to negative, they're all heavily leaning into the application layer when superintelligence would obsolete half the applications in question, etc.

But that's relative to superintelligence.

If we reign it back into just normal high intelligence, like models continuing to get better at navigating complex codebases and write high quality idiomatic code, then I don't see any special shapes.

p1esk 39 days ago

The only big remaining problem in AI is continual learning. A lot of smart people are working on that. To me it looks like we are 1-2 breakthroughs away from AGI.

lucasban 39 days ago

Not that I agree with them, but your tone could be more constructive as well.

gdhkgdhkvff 39 days ago

You know what? I agree. I should have avoided falling into the same trap.

sesteel 39 days ago

Agreed. For all we know, humans are only considered intelligent locally among ourselves, not universally. Every time we learn more about the universe, we seem to also learn how insignificant and wrong we are.

le-mark 39 days ago

Nausea aside, what evidence does anyone have that “super intelligence” of the sort your argument alludes to is even possible? Because that’s what we’re really talking about; greater than human intelligence on this sort of academic task. For example; When llms start contributing meaningfully to their own development, that would be a convincing indicator imo.

jeremyjh 39 days ago

This discussion is not about superintelligence, it is about continued progress. Fully general human intelligence at much lower cost than humans is all that is required to profoundly reshape society, but it is not clear even that will happen soon.

As the blog points out - this is one particular subfield where LLMs have much easier prospects - lots of low hanging fruit that “just” requires a couple weeks of PHD candidate research.

Mathematics itself is one of a small handful of endeavors where automated reinforcement training is extremely straightforward and can be done at massive scale without humans.

Neither of these factors place a structural bound on the kind of thing LLMs can be good at, but we are far from certain we can achieve performance at this level in other fields economically and in the near future.

programjames 39 days ago

Well, a decent GPU runs on 20x the wattage of a human brain. That's evidence humans are constrained in ways artificial intelligences will not be.

filipn 39 days ago

You're comparing a gpu to a human brain?

sesteel 39 days ago

Why wouldn't you? From both emerge intelligence.

bdangubic 39 days ago

> When llms start contributing meaningfully to their own development, that would be a convincing indicator imo.

This has been the case for awhile now already…

https://kersai.com/the-48-hours-that-changed-ai-forever-clau...

le-mark 39 days ago

> The model essentially served as an on-call teammate across MLOps and DevOps tasks, compressing feedback cycles that typically consume expert time

I personally would not characterize automating training processes as “meaningfully”.

eiieue 39 days ago

And yet the world hasn’t changed all that much except people getting laid off in response to over-hiring prior to the diffusion of llm’s.

daishi55 39 days ago

> over-hiring

For how long should you be allowed to use this excuse? It’s nearly 5 years since the peak of COVID hiring. What’s an acceptable limit - 10 years? Of course at that point you can just switch over to outsourcing and “stupid MBAs”, the other two of Reddit’s favorite scapegoats. I find a lot of the AI skepticism to be totally unfalsifiable.

nostrebored 39 days ago

Hmm, I don’t know, maybe the fact that 4.6, 4.7, 5.3, 5.4, 5.5, 3.0, 3.1 are all marginal improvements?

programjames 39 days ago

I think people's opinion of "marginal improvement" is based on their relative ability. A 2000 elo chess player is going to think the jump from 500 to 1000 is marginal. They're both floundering around not doing anything resembling common sense. A 1000 elo chess player is going to find the jump from 2000 to 2500 marginal. They're both playing far better moves for incomprehensible reasons, and the only reason you know the 2500 player is better is due to benchmarking. It is only when you are evaluating systems about at your level that you can feel the improvement.

I, personally, found the past two years to be a much larger improvement than the previous two years.

nostrebored 39 days ago

2024-2025 was filled with huge improvements. 2025-2026 has not been, outside of open source.

The idea that we’re at the point where it’s superseded our ability to tell just makes no sense. I’ll be happy if we can get to a point where I don’t have to tell Claude not to tail every bash command or make a job that writes throughout instead of once at the end. I’ll be happy if “continue this interaction naturally, you are taking over from an independent subagent” works.

But I’m not holding my breath. It’s still really cool that any of this stuff is possible.

miki123211 39 days ago

Claude in feb of 2025 was barely able to code. Sure, it could write you a nice function, it could even write you a complex 200-line algorithm, but give it a codebase, and it would quickly get overwhelmed.

Claude in feb of 2026? Still far from perfect, but there's definitely a huge improvement here.

dang 39 days ago

> I think this is a pretty ridiculous take.

This falls in the category of swipes/name-calling in https://news.ycombinator.com/newsguidelines.html - can you please edit those out?

You're a good contributor - it's just all too easy for unintentional sharpness to downgrade the conversation, and when it's a good conversation like this one, that's especially regrettable.

spwa4 39 days ago

The correct way to estimate this is exactly what people do. Measure the distance between ChatGPT's best public model and state of the art, the best humans. And there is very little difference between those versions from that perspective. It is very far away from peak human performance, and not getting noticeably closer for over a year now. There's lots of progress, but if you're OpenAI/Anthropic/Google, exactly the wrong kind of progress: the difference between ChatGPT 5.5 and a 27B/4B model (you need to try Gemma4-26B-A4B, wtf, it runs acceptably on CPU) is now reduced to ELO 1501 vs ELO 1434, generously a 70 ELO point difference, down from over 400, data from Arena.ai.

(in fact I find that Qwen-35B-A3B and Gemma4-26B-A4B very rarely "know" the answer, and so use first principles thinking, or go out and look for the answer where GPT-5.4 does not and simply assumes it knows. Which leads to now, in some cases, the small models far outperforming the big ones. Huge context + training quality seem to be the determining factors now, and neither of those are the strengths of SOTA models. If this continues ...)

While I agree this is a training problem, it is not a solvable one. ML models learn from examples. This is even true for their newest tricks like GRPO. They cannot train against things humans don't yet know.

And that's great, but you're forever locked at the peak of what you can be taught in widely available courses (which they download without paying) (even that is best case scenario: it assumes your ability to distinguish bullshit from reality somehow becomes perfect during training, or even before). The only way to exceed peak human performance is to start experimenting with math, physics, chemistry, even humans, yourself. And that has, even for humans, a massively higher cost than learning from examples, or from a course.

The reason they don't go further is the worst possible reason: the cost. It requires a 100x increase in training expense. Think of it like this: to exceed SOTA in physics or chemistry, training the next version of ChatGPT requires a particle accelerator, and a chemistry laboratory. This cannot be bypassed. Oh and not just any particle accelerator, right? A better one than the best currently existing one. Same for Chemistry labs. Same for ... So 100x is conservative.

But without doing it, ML models (LLM or otherwise) are forever limited at the level an army of first year university students achieve, ON AVERAGE. Maybe they can make that 2nd or even 4th year, at the end of the curve. But that's the limit. Phd level is the level you have to come up with new discoveries, and that ... just isn't possible with current training, even at the end of the improvement curve.

And ... is there budget to increase training cost another 100x? No ... there isn't. Not even with this totally absurd level of investment there isn't. And if small models keep this up, there's no way the investment is even remotely worth it.

gdhkgdhkvff 39 days ago

Gemini 3.0 wasn’t just a marginal improvement over 2.5.

And if you take that out: 1. All of those releases happened literally in the last 3-ish months. 2. They’re all intentionally marginal releases, hence the minor version bumps instead of major versions.

sigmarule 39 days ago

Equally marginal?

nostrebored 39 days ago

No, the anthropic releases have felt marginally negative

gtowey 39 days ago

Because the premise that the singularity is just around the corner is far less likely than the premise that artificial intelligence is a lot harder than most people think it is and we're not that close.

Especially because the companies telling us the first premise is true are the companies which need investors to prop up their business.

I mean, it is possible the first premise is true, but the absolutely bonkers credulity in it really mystifies me. It is an incredibly unlikely thing to be true and we should be demanding quite extraordinary evidence to back it up. But based on some neat tricks by current LLMs, some people are all in.

mlyle 39 days ago

> > And that shape tells you that _at some unknown point in the future_ progress will slow (but likely not stop). Now back to the point, what reason do you have to believe progress will stop soon?

> Because the premise that the singularity is just around the corner is far less likely than the premise that artificial intelligence is a lot harder than most people think it is and we're not that close.

I see no claim that the singularity is around the corner, so I'm not sure your reply meets the comment that you're replying to.

It seems overwhelmingly likely that AI will be significantly more capable 6 months from now than it is now. Even if there's little progress in the models, just the rate at which tooling is moving will make a big difference. And models still seem to be improving, so I'd be a little surprised if we hit a model brick wall.

aspenmartin 39 days ago

It’s more of a guess if you don’t know about things like scaling laws and RL with verification. The onus of “we’re going to saturate” anytime soon is on that claim because every measurement points to that not being true.

emp17344 39 days ago

But… RL doesn’t scale that well. It’s not the silver bullet you think it is.

logicprog 39 days ago

Yeah. People (Gary Marcus) have been claiming that AI will hit a wall or is hitting a wall or already has hit a wall since 2023, basically. And yet every time they proclaim that the AI industry found new ways of training their AI's, new ways of integrating them with external tools and feedback loops, new architectures and more to keep the exponential growing. And sure enough if you look at literally every attempt to objectively rate and verify the capability of these models, including things like the METR time horizon autonomy index or the artificial analysis intelligence index, you see exponential or even greater than exponential growth, continuing smoothly through each of the points people claimed that it would begin to slow down, with no sinus slowing down or stopping at all. So yeah, I think at some point the onus has to lie on the ones that are making the claim that keeps being wrong and the continues to be wrong and it completely goes against the current tangent of the curve that we're seeing in all objective metrics. Especially when they can't give specific new reasons for progress to stop beyond the ones they gave last time. It didn't stop and really can't give specific reasons at all besides vague general points about stochastic parrots and S curves.

I really have to highlight the S-curve nonsense because, like, yes, I think this technology's improvement will follow an S-curve. It's absurd to think that it will just follow an exponential up towards infinity forever because nothing in the world really works like that. However, like everyone else in this thread is saying, we have no idea where on the S-curve we actually are, and it's impossible to know until it's already slowed down. So really all appeals to the S curve do are as function as a sort of non-specific, unfalsifiable prophecy that someday it will slow down, which doesn't really tell us anything useful, and also frees the person referencing the S curve from ever actually having to worry about being wrong. Just like the Singularity people, the slowdown of the S curve is always near. This is actually a known and well-established tactic of religions and other people that want to make prophecies without having to worry about turning out to be wrong — unfalseifiable vague prophecies with no actual timeline, and thus no clear import to the present so that they can never be shown to be wrong.

aurareturn 39 days ago

He said "will stop anytime soon". He didn't say forever.

Lionga 39 days ago

Which still makes no sense. There is the same chance we are flatlining now as that we are flatlining in e.g. 3 years or 5 years.

squidbeak 39 days ago

In what sense are the models flatlining?

nicoburns 39 days ago

In the sense that the incremental improvements in capabilities that we've been seeing in recent models seem to taking exponentially growing amounts of compute to achieve.

But they don't?

Mythos is a 10T model. Opus is a 5T model.

That's not an exponentially growing amount of compute but it is achieving exponential improvements (eg from Mozilla: https://blog.mozilla.org/en/privacy-security/ai-security-zer... )

vessenes 39 days ago

There are advancements that do not follow s curves - consider for instance total data transmitted over all networks, or financial derivatives volumes.

I think a better question for AI is “is it more like a network effect, liquidity effect, or a biological/physical effect”?

010101010101 39 days ago

Those are measuring the utility of a technological advancement by looking at usage, not the pace of advancement of said technology.

vessenes 39 days ago

Yes. But quantity has a quality all its own, as they say — derivatives have gone through at least a few step functions where they have become more important and more useful as their usage grows. I’d call that advancement.

Maybe just to be clear I think that kneejerk “I hate this AI trend, and prefer to believe this will end soon, all exponential growth ends eventually” is intellectually lazy, and dangerous for younger engineers/hackers, a group I hope can benefit from being on HN.

Bitcoin mining went through something like 13 10x growth periods, last I ran the numbers a few years ago. There are physical processes that do have very extended periods of doubling, and there are digital and financial processes that don’t show any signs of doing anything but continuing to keep growing over their multidecade lives. So, like I said, it’s worth thinking carefully, and risk mitigation for things like mental health, career decisions and investment decisions indicates we should be cautious assessing new dynamics.

coldtea 39 days ago

>There are advancements that do not follow s curves - consider for instance total data transmitted over all networks, or financial derivatives volumes

Or Roman trade volume before the Fall of Rome.

Not to mention what you describe is not technological improvement but increase in data or money flows, not the same.

vessenes 39 days ago

Sic transit gloria - obviously.

But I don’t that think it’s quite so obvious that model quality / growth / usefulness is definitively and obviously not more like data or money flows than it is like some other process.

camdenreslink 39 days ago

Total volume of usage is not an advancement, it’s orthogonal.

AlexandrB 39 days ago

Indeed, and it's more linked with market penetration than technological advancement. It's like evaluating airplane technology by "total miles flown".

gchamonlive 39 days ago

This could be right for the current architecture of LLMs, but you can come up with specialized large language models that can more efficiently use tokens for a specific subset of problems by encoding the information differently (https://www.nature.com/articles/d41586-024-03214-7).

So if instead of text we come up with a different representation for mathematical or physical problems, that could both improve the quality of the output while reducing the amount of transformers needed for decoding and encoding IO and for internal reasoning.

There are also difference inference methods, like autoregressive and diffusion, and maybe others we haven't discovered yet.

You combine those variables, along with the internal disposition of layers, parameter size and the actual dataset, and you have such a large search space for different models that no one can reliably tell if LLM performance is going to flatline or continue to improve exponentially.

ifdefdebug 39 days ago

> So if instead of text we come up with a different representation for mathematical or physical problems, that could both improve

But then, wouldn't we first have to translate all of our current math and physics knowledge into that new representation in order to be able to train a model on it? Looks like a tremendous amount of work to me.

gchamonlive 39 days ago

Yes, but by then you already have general LLMs capable of helping with the work. And even if you didn't, if that's what it would take to advance research in these fields, that would be a justifiable effort.

coldtea 39 days ago

>This could be right for the current architecture of LLMs, but you can come up with specialized large language models that can more efficiently use tokens for a specific subset of problems by encoding the information differently.

That's precisely what happens on the bad side of a S curve.

gchamonlive 39 days ago

Progress don't stop however, and the S curve resets, because then you are optimizing a new architecture.

dehrmann 39 days ago

I read an experiment someone wanted to try where they used pre-1900 content and tried to get relativity. Another version would be train an LLM on school curriculum up until calculus and see if it can invent calculus. Where we are on the curve depends on if it's remixing known things or genuinely inventing things.

From the article,

> ...LLMs have got to the point where if a problem has an easy argument that for one reason or another human mathematicians have missed (that reason sometimes, but not always, being that the problem has not received all that much attention), then there is a good chance that the LLMs will spot it. Conversely, for problems where one’s initial reaction is to be impressed that an LLM has come up with a clever argument, it often turns out on closer inspection that there are precedents for those arguments...

CuriouslyC 39 days ago

What people miss is that AI isn't one S curve, each capability we try to bake into a model has its own S curve. Model progress might not impact some capabilities at all, but other capabilities might get totally overhauled.

holoduke 39 days ago

Software and hardware have no limits. Theoretically would could bozons for computations and have the same amount of computation available on one cm3 of the current total computation in the entire world. Same with software. Never there was a stop on new algorithms. With LLMs there are so many parts that will get better and are not very far fetched.

oblio 39 days ago

> Software and hardware have no limits.

Yeah, if time is infinite, R&D imagination is infinite, energy is infinite and material resources are infinite. Easy.

IanCal 39 days ago

Assuming it’ll stop soon is to wager that we’re at a very specific point on the curve.

If it’s anyone’s guess then we’re much more likely to be left of that, unless you argue we’re already on the flat side.

scotty79 39 days ago

It can be S curve (and it almost surely is), but on every chart you can plot, you don't see even of an inkling of the bend yet.

baq 39 days ago

you can tell where on the sigmoid we're currently sitting? frontier lab folks can't - chapeau bas good sir

bigyabai 39 days ago

> frontier lab folks can't

Do you have a source for this that isn't marketing spiel? There's a fiscal incentive to lie about scaling research.

Der_Einzige 39 days ago

This is FUD and extremely wrong. None of the advancements have followed an S curve. This time IS different and it should be obvious to you at this point.

jeremyjh 39 days ago

What the fuck does that have to do with “soon”?

civvv 39 days ago

There are many indications that model progress is slowing down, so that is not entirely accurate.

aspenmartin 39 days ago

Please be specific because outside of anecdotal blog posts by people who don’t know what they’re talking about it’s not true. Look at scaling laws, composite benchmarks from the epoch capability index, nothing at all suggests “model progress is slowing down”

StrauXX 39 days ago

Which indications are that?

nicoburns 39 days ago

The cost factors on the new models compared to the old models.

jeremyjh 39 days ago

Qwen3.6 9B is as good as GPT-4o and runs on my M2 MacBook Air. Models are getting stronger and less costly at the same time, but these are somewhat separate branches of research. Frontier labs are spending more because they are still getting marginal returns and there is more capacity to spend than there was a year ago.

gertop 39 days ago

Qwen 3.6 9B doesn't exist.

If you meant 3.5 9B and you truly believe it's as good as 4o then I can only assume you have a very basic use case.

jeremyjh 38 days ago

You are right, I was mistaken about the version. I evaluated it in general chat assistant prompts plucked from my history across a range of topics but did not use it for coding - there was never a time when I thought 4o was “good enough” for agentic coding.

bdelmas 39 days ago

You are mixing cost and progress. It’s not because it’s more and more expensive that progress is slowing down by itself.

nicoburns 39 days ago

They are intrinsically linked beyond a certain point. If we're making progress but costs are spiraling exponentially then it stands to reason that we will soon reach a point where we can no longer afford the increasing costs and thus progress will slow.

(barring some breakthrough that reduces costs, which of course may happen, but for which recent model improvements are not strong evidence of)

aspenmartin 39 days ago

Cost for a specific level of performance decreases 10x per year, this has been a pretty consistent property for awhile now.

butlike 35 days ago

I guess within the domain of AI, a pertinent question would be: "do I want to use anything but the best?" The errors older models give being directly analogous to being stupider in my eyes.

aspenmartin 35 days ago

Depends — many tasks in various pipelines have a reasonable Pareto frontier and diminishing returns after a certain level of performance. You may just have a high budget constraint (say like YouTube computing ASR subtitles; they are not going to be using the best ASR models because it’s expensive). If it’s myself, with a coding agent, I’m going to get the best thing I can afford.

overfeed 39 days ago

Investment dollars.

dzhiurgis 39 days ago

Source for that claim?

lionkor 39 days ago

Nobody is releasing NEW models

aspenmartin 39 days ago

…not only is this not true but it also doesn’t matter. Why would this indicate performance saturating?

taneq 39 days ago

The standard networking connection has been called “Ethernet” for more than thirty years, so networking has stagnated, right?

SlinkyOnStairs 39 days ago

If higher bandwidth networking consisted primarily running more and more ethernet lines in parallel, you would most certainly agree that "networking has stagnated".

"Reasoning" and now "Agentic" AI systems are not some fundamental improvement on LLMs, they're just running roughly the same prior-gen LLMS, multiple times.

Hence the conclusion that LLM improvement has slowed down, if not stagnated entirely, and that we should not expect the improvements of switching to these "reasoning" systems to keep happening.

p1esk 39 days ago

From TFA:

“ChatGPT came up with an idea which is original and clever. It is the sort of idea I would be very proud to come up with after a week or two of pondering, and it took ChatGPT less than an hour to find and prove”

kstenerud 39 days ago

What constitutes a NEW model for the purposes of calculating progress?

GardenLetter27 39 days ago

What? DeepSeekV3 just came out and is incredible for the price. Mythos is also half-released.

nozzlegear 39 days ago

Until you or I can actually use Mythos in Claude without an nda or other strings attached, Mythos is not released and is just an effective marketing tool for Anthropic.

pixl97 38 days ago

At least to me this is a pretty sour grapes take. There are all kinds of released products that are expensive or need an NDA. You're just too poor to afford it. But make no mistakes there are governments using this in mass and likely against you.

CuriouslyC 39 days ago

Model progress at spitting out unhallucinated facts is slowing down hard. Model progress at solving hard math challenges/programming tasks doesn't seem to be slowing down that I can tell.

Davidzheng 39 days ago

Deep think still makes many many many more mistakes than gpt 5.5 pro on math