Hacker News new | ask | show | jobs
by woeirua 89 days ago
This study's cutoff date was August 2025. I don't think this result is surprising given the level of coding agent ability back then. The whole thing just shows how out-of-date academic publishing is on this subject.

>This yields 806 repositories with adoption dates between January 2024 and March 2025 that are still available on GitHub at the time of data analysis (August 2025).

There were very few people who thought that coding agents worked very well back then. I was not one of them, but I _do_ think they work today.

3 comments

Evergreen excuses for tech people desperately want to work. I get why, it would give you agency to do other things you WANT to do. I tried reviewing a colleagues agent-generated code and it was practically unreviewable. I watched him blame himself, saying he just needed to adjust a parameter. He tried everything except admit the machine does not conceptionaly understand what he was asking.
we're one more rl run from Codex not trying to satisfy the type checker by replacing an index accessor with a BFS of all the keys in the API response and matching the correct property via regex.

one more, i swear.

Eh, I don't know. I mean, are we seeing better models now? Of course. But are they truly leaps and bounds better? No, and I get confused by people saying that they are. They're better but not like... 10x better.

And when people were studying ChatGPT 3.5, everyone would go "Oh, but that wasn't 4!", and when people talk about Opus 4.5 they go "4.6 is so much better!".

My personal position right now is that people are extremely bad at evaluating model output/ changes in model capabilities. Model benchmarks do not reflect the position that models are just 10x better than they were a year ago, but with how people discuss them you'd think that 10x was underselling it.

Every single objective metric that we have access seems to suggest that they are, and the zeitgeist online seems to suggest that they are, but you're right. Your personal experience trumps all of that.
I don't have personal experience, but there seems to be a broad consensus that Opus 4.5 was tipping point between "kinda bad" and "actually kinda useful".

So a cutoff point of August 2025 just before that is a bit unfortunate (I'm sure they'll be newer studies soon).

That reaction has happened with every model release for the past few years. Maybe they aren’t the same people, but it’s always “old model was terrible, new model gets it right” then “new model was terrible, newer model gets it right,” ad infinitum.
A large proportion of my professional network were in the "AI for code generatin might just be a fad" camp pre Opus 4.5 (and the Codex/Gemini models that came out shortly after that), and now almost everyone seems to think that AI will have at least some place in professional development environments on an ongoing basis.

I've recently given it a go myself, and it certainly doesn't get it right all the time. But I was able to generate AI-assisted code that met my quality standards at roughly the same speed as coding it by hand.

FWIW I am definitely someone who uses AI. I have been using it for a few years now. There's no question that models have improved. I'd say the biggest leap was around the ChatGPT 3.5 -> 4.0, which radically reduced hallucination problems. The big issue of "it just made up a module that doesn't exist" more or less went away at that point. This was the big leap from "spits out text that might help you" to "can produce value".

Since then it has been incremental. I would say the big win has been that models degrade more slowly as context grows. This means, especially for heavily vibecoded-from-scratch projects, that you hit the "I don't even know wtf this is anymore" wall way later, maybe never if you're steering things properly.

I think because you can avoid hitting that wall for longer, people see this as radically different. It's debatable whether that's true or not. But in terms of just what the model does, like how it responds to prompts, I genuinely think it is only marginally better. And again, I think benchmarks confirm this, and I quite like Fodor's analysis on benchmarking here[0].

I use these models daily and I try new models out. I think that people over emphasize "model did something different" or "it got it right" when they switch over to a new model as "this is radically better", which I believe is simply a result of cognitive bias / poor measurement.

[0] https://jamesfodor.com/2025/06/22/line-goes-up-large-languag...

I have experience and the gap is exaggerated imo. edit: And I think benchmarks largely support this, and benchmarks are already biased to overstate LLM performance IMO.
I used Claude Code before August 2025 and it was definitely usable, although clearly more capable now. The difference is noticeable but not a completely different world, all in all, in my eyes.

I notice on a daily basis even now that it can easily lead to bloat and unnecessary complexity. We will see if it can be fixed by using even stronger models or not.

This is the perennial excuse, and I'm sure we'll continue to see it. Folks will say the exact same thing when the current crop of slop-generators have been replaced by a newer ilk