| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by simonw 171 days ago

I see it brought up almost every week! It's a firm favorite of the "LLMs don't actually help write code" contingent, probably because there are very few other credible studies they can point to in support of their position.

Most people who cite it clearly didn't read as far as the table where METR themselves say:

> We do not provide evidence that:

> 1) AI systems do not currently speed up many or most software developers. Clarification: We do not claim that our developers or repositories represent a majority or plurality of software development work

> 2) AI systems do not speed up individuals or groups in domains other than software development. Clarification: We only study software development

> 3) AI systems in the near future will not speed up developers in our exact setting. Clarification: Progress is difficult to predict, and there has been substantial AI progress over the past five years [3]

> 4) There are not ways of using existing AI systems more effectively to achieve positive speedup in our exact setting. Clarification: Cursor does not sample many tokens from LLMs, it may not use optimal prompting/scaffolding, and domain/repository-specific training/finetuning/few-shot learning could yield positive speedup

https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o...

2 comments

fancyfredbot 171 days ago

Weird, you shouldn't really need to list the things your study doesn't prove! I guess they anticipated that the study might be misrepresented and wanted to get ahead of that.

Their study still shows something interesting, and quite surprising. But if you choose to extrapolate from this specific setting and say coding assistants don't work in general then that's not scientific and you need to be careful.

I think the studyshould probably decrease your prior that AI assistants actually speed up development, even if developers using AI tell you otherwise. The fact it feels faster when it is slower is super interesting.

simonw 171 days ago

The lesson I took from the study is that developers are terrible at estimating their own productivity based on a new tool.

Being armed with that knowledge is useful when thinking about my own productivity, as I know that there's a risk of me over-estimating the impact of this stuff.

But then I look at https://github.com/simonw which currently lists 530 commits over 46 repositories for the month of December, which is the month I started using Opus 4.5 in Claude Code. That looks pretty credible to me!

pydry 171 days ago

The lesson I learned is that agentic coding uses intermittent reinforcement to mimic a slot machine.

It (along with the hundreds of billions in investments hinging on it), explains the legions of people online who passionately defend their "system". Every gambler has a "system" and they usually earnestly believe it is helping them.

Some people even write popular (and profitable!) blogs about playing slots machines where they share their tips and tricks.

logicprog 171 days ago

I really wish this meme would die.

We know LLMs instruction follow meaningfully and relatively consistently; we know they are in context learners and also pull from their context window for knowledge; we also know that prompt phrasing and especially organization can have a large effect on their behavior in general; we know from first principles that you can improve the reliability of their results by putting them in a loop with compilers / linters / tests because they do actually fix things when you tell them to. None of this is equivalent to a gambler's superstitions. It may not be perfectly effective, but neither are a million other systems and best practices and paradigms in software.

Also, it doesn't "use" anything. It may be a feature of the program but it isn't intentionally designed that way.

Also who sits around rerunning the same prompt over and over again to see if you get a different outcome like its a slot machine? You just directly tell it to fix whatever was bad about the output and it does so. Sometimes initial outputs have a larger or smaller amount of bad, but still. It isn't really analogous to a slot machine.

Also, you talk as if the whole "do something -> might work / might not, stochastic to a degree, but also meaningfully directable -> dopamine rush if it does; if not goto 1" loop isn't inherent to coding lol

pydry 171 days ago

I dont think the "meme" that LLMs follow instructions inconsistently will ever die because they do. It's in the nature of how LLMs function under the hood.

>Also who sits around rerunning the same prompt over and over again to see if you get a different outcome like its a slot machine?

Nobody. Plenty of people do like to tell the LLM that somebody might die if they dont do X properly and other such faith based interventions with their "magic box" though.

Boy do their eyes light up when they hit the "jackpot", too (LLM writes what appears to be the correct code on the first shot).

simonw 171 days ago

They're so much more consistent now than they used to be. The new LLMs almost always boast about how much better they are at "instruction following" and it really shows, I find Claude 4.5 and GPT-5.x models do exactly what I tell them to most of the time.

Snuggly73 171 days ago

I am going to prefix this with that I could be completely wrong.

Simon - you are an outlier in the sense that basically your job is to play with LLMs. You don't have stakeholders with requirements that they themselves don't understand, you don't have to go to meetings, deal with a team, shout at people, do PRs etc., etc. The whole SDLC/process of SWE is compressed for you.

simonw 171 days ago

That's mostly (though not 100%) true, and a fair comment to make here.

Something that's a little relevant to how I work here is that I deliberately use big-team software engineering methods - issue trackers, automated tests, CI, PR code reviews, comprehensive documentation, well-tuned development environments - for all of my personal projects, because I find they help me move faster: https://simonwillison.net/2022/Nov/26/productivity/

But yes, it's entirely fair to point out that my use of LLMs is quite detached from how they might be used on large team commercial projects.

lelanthran 171 days ago

I think this shows where the real value of AI coding is: brand new repos, on tiny throwaway projects.

I'm not going to browse every commit in that repo, but half of the projects were created in december. The rest are either a few months old or less than a year.

This is not representative of the industry.

fancyfredbot 171 days ago

That's certainly an impressive month! However, it's conceivable that you are an outlier (in the best possible way!)

I liked the way they did that study and I would be interested to see an updated version with new tools.

I'm not particularly sceptical myself and my guess is that using Opus 4.5 would probably have produced a different result to the one in the original study.

simonw 171 days ago

I'm definitely an outlier - I've been pushing the boundaries of these tools for three years now and this month I've been deliberately throwing some absurdly ambitious problems at Opus 4.5 (like this one: https://static.simonwillison.net/static/2025/claude-code-mic...) to see how far it can go.

fancyfredbot 171 days ago

Very interesting example. It's an insanely complex task even with a reference implementation in another language.

It's surprising that it manages the majority of the test cases but not all of them. That's not a very human-like result. I would expect humans to be bimodal with some people getting stuck earlier and the rest completing everything. Fractal intelligence strikes again I guess?

Do you think the way you specified the task at such a high level made it easier for Claude? I would have probably tried to be much more specific for example by translating on a file by file or function by function basis. But I've no idea if this is a good approach. I'm really tempted to try this now! Very inspiring.

simonw 171 days ago

> Do you think the way you specified the task at such a high level made it easier for Claude?

Absolutely. The trick I've found works best for these longer tasks is to give it an existing test suite and a goal to get those tests to pass, see also: https://simonwillison.net/2025/Dec/15/porting-justhtml/

In this case ripping off the MicroQuickJS test suite was the big unlock.

I have a WebAssembly runtime demo I need to publish where I used the WebAssembly specification itself, which it turns out has a comprehensive test suite built in as well.

kwertyoowiyop 171 days ago

In the 80s, when the mouse was just becoming common, there was a study comparing programming using a mouse vs. just a keyboard. Programmers thought they were faster using a keyboard, but they were actually faster using a mouse.

logicprog 171 days ago

That's the Ask Tog "study"[1]. It wasn't programmers, just regular users. The problem is he just says it, and of course Apple at the time of the Macintosh's development would have a strong motivation to prove mousing superior to keyboarding to skeptical users. Additionally, the experience level of the users was never specified.

[1]: https://www.asktog.com/TOI/toi06KeyboardVMouse1.html

harvey9 171 days ago

This suprises me because at the time user interfaces were optimised for keyboard - the only input device most people had. Also screen resolutions were lower so there were fewer things you could click on anyway.

mossTechnician 171 days ago

METR has some substantial AI industry ties, so I wonder if those clarifications (especially the one pointing at their own studies describing AI progress) are a way to mitigate concerns that industry would have with the apparent results of this study.