| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by staticassertion 90 days ago

Eh, I don't know. I mean, are we seeing better models now? Of course. But are they truly leaps and bounds better? No, and I get confused by people saying that they are. They're better but not like... 10x better.

And when people were studying ChatGPT 3.5, everyone would go "Oh, but that wasn't 4!", and when people talk about Opus 4.5 they go "4.6 is so much better!".

My personal position right now is that people are extremely bad at evaluating model output/ changes in model capabilities. Model benchmarks do not reflect the position that models are just 10x better than they were a year ago, but with how people discuss them you'd think that 10x was underselling it.

2 comments

woeirua 84 days ago

Every single objective metric that we have access seems to suggest that they are, and the zeitgeist online seems to suggest that they are, but you're right. Your personal experience trumps all of that.

link

nicoburns 90 days ago

I don't have personal experience, but there seems to be a broad consensus that Opus 4.5 was tipping point between "kinda bad" and "actually kinda useful".

So a cutoff point of August 2025 just before that is a bit unfortunate (I'm sure they'll be newer studies soon).

link

jdlshore 90 days ago

That reaction has happened with every model release for the past few years. Maybe they aren’t the same people, but it’s always “old model was terrible, new model gets it right” then “new model was terrible, newer model gets it right,” ad infinitum.

link

nicoburns 90 days ago

A large proportion of my professional network were in the "AI for code generatin might just be a fad" camp pre Opus 4.5 (and the Codex/Gemini models that came out shortly after that), and now almost everyone seems to think that AI will have at least some place in professional development environments on an ongoing basis.

I've recently given it a go myself, and it certainly doesn't get it right all the time. But I was able to generate AI-assisted code that met my quality standards at roughly the same speed as coding it by hand.

link

staticassertion 90 days ago

FWIW I am definitely someone who uses AI. I have been using it for a few years now. There's no question that models have improved. I'd say the biggest leap was around the ChatGPT 3.5 -> 4.0, which radically reduced hallucination problems. The big issue of "it just made up a module that doesn't exist" more or less went away at that point. This was the big leap from "spits out text that might help you" to "can produce value".

Since then it has been incremental. I would say the big win has been that models degrade more slowly as context grows. This means, especially for heavily vibecoded-from-scratch projects, that you hit the "I don't even know wtf this is anymore" wall way later, maybe never if you're steering things properly.

I think because you can avoid hitting that wall for longer, people see this as radically different. It's debatable whether that's true or not. But in terms of just what the model does, like how it responds to prompts, I genuinely think it is only marginally better. And again, I think benchmarks confirm this, and I quite like Fodor's analysis on benchmarking here[0].

I use these models daily and I try new models out. I think that people over emphasize "model did something different" or "it got it right" when they switch over to a new model as "this is radically better", which I believe is simply a result of cognitive bias / poor measurement.

[0] https://jamesfodor.com/2025/06/22/line-goes-up-large-languag...

link

staticassertion 90 days ago

I have experience and the gap is exaggerated imo. edit: And I think benchmarks largely support this, and benchmarks are already biased to overstate LLM performance IMO.

link

kaffekaka 89 days ago

I used Claude Code before August 2025 and it was definitely usable, although clearly more capable now. The difference is noticeable but not a completely different world, all in all, in my eyes.

I notice on a daily basis even now that it can easily lead to bloat and unnecessary complexity. We will see if it can be fixed by using even stronger models or not.

link