| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by agentultra 259 days ago

I don’t think people are good at self-reporting the “boost” it gives them.

We need more empirical evidence. And historically we’re really bad at running such studies and they’re usually incredibly expensive. And the people with the money aren’t interested in engineering. They generally have other motives for allowing FUD and hype about productivity to spread.

Personally I don’t see these tools going much further than where they are now. They choke on anything that isn’t a greenfield project and consistently produce unwanted results. I don’t know what magic incantations and combinations of agents people have got set up but if that’s what they call “engineering,” these days I’m not sure that word has any meaning anymore.

Maybe these tools will get there one day but don’t go holding your breath.

1 comments

simonw 259 days ago

> They choke on anything that isn’t a greenfield project and consistently produce unwanted results.

That was true 8 months ago. It's not true today, because of the one-two punch of modern longer-context "reasoning" models (Claude 4+, GPT-5+) and terminal-based coding agents (Claude Code, Codex CLI).

Setting those loose an an existing large project is a very different experience from previous LLM tools.

I've watched Claude Code use grep to find potential candidates for a change I want to make, then read the related code, follow back the chain of function calls, track down the relevant tests, make a quick detour to fetch the source code of a dependency directly from GitHub (by guessing the URL to the raw file) in order to confirm a detail, make the change, test the change with an ad-hoc "python -c ..." script, add a new automated test, run the tests and declare victory.

That's a different class entirely from what GPT-4o was able to do.

link

XenophileJKO 259 days ago

I think the thing people have to understand is how fast the value proposition is changing. There is a lot of conversation about "plateauing" model performance, but the actual experience from the combination of the model and tooling changes is night and day in the last 3 months. It was beginning to be very useful with Claude 3.7 in the spring this year, but we have just gone through a step function change.

I was decomissioning some code and I made the mistake of asking for an "exhaustive" analysis of the areas I needed to remove. Sonnet 4.5 took 30 minutes looking around and compiling a detailed report on exactly what needed to be removed from this very very brownfield project and after I reviewed the report, it one shot the decommisioning of the code (in this case I was using CLaude in the Cursor tooling at work). It was overkill, but impressive how well it mapped all the ramifications in the code base by greping around.

link

manmal 258 days ago

Indeed, Codex CLI is quite useful even for demanding tasks. The current problem is that it might gather context for 20 minutes before doing the actual thing. The question is whether this will be sped up significantly.

link

what 259 days ago

I guess we just have to take your word for this, which is somewhat odd considering most of your comments link back to some artifact of yours. Are you paid by any of these companies?

link

simonw 259 days ago

I'm not paid by any of them, but I occasionally get preview access to models or invites to events. I attended OpenAI's DevDay on Monday for free, for example.

I have a disclosures section on my blog here: https://simonwillison.net/about/#disclosures

link

csar 258 days ago

OP is one of the co-creators of Django (for which I am eternally grateful, having built my first company on top of it) and one of the most prolific writers in the space. I also happen to strongly agree with his assessment, though as he said getting that amount of value out of current tools is real work.

link

danielbln 258 days ago

It is real work, and it requires solid priors to do it. The cynical people punch three prompts in, are disappointed that it doesn't work in their codebase they've worked in for 2 decades and complain that everyone is a shill and that people should stop saying they "hold it wrong".

The skill ceiling is high, it turns out. It's just deceptive, because it's so easy to get going. Ultra accessible foot gun, lots of work to point it in the right direction reliably and repeatedly. Significant benefits of you manage though.

I've gotten more relaxed about it now though. People will either get it or they don't.

link

mohsen1 258 days ago

https://github.com/bodo-run/yek/pull/213

here is an example of mostly automated work. It's a small feature but it was done perfectly

link

IanCal 258 days ago

That the tools do this kind of thing? They do, they’ll go through pretty long multi step processes to find things and edit them. They run tests, check output, see it’s wrong and go and add debug statements, rerun, try and fix things, rerun, then remove the logging.

link