Hacker News new | ask | show | jobs
by Swizec 25 days ago
> How are we really supposed to grasp their actual capabilities when no one will actually cite specifically what mistakes they are making.

The mistakes they make are pretty subtle. Coding with LLMs can be like that scene in Whiplash – <excellent drumming >, not quite my tempo, <excellent drumming >, downbeat on 18, <excellent drumming>, you’re rushing, <excellent drumming>, dragging, …

Like yeah it produces working code almost always and the code usually does what you asked. And yet it makes you want to throw a chair because it’s not quite right in frustrating ways and it doesn’t even have the taste to know how it’s wrong.

3 comments

Yeah. It does this. Pretty consistently and replicably depending on the issue, in fact! Yet I can point exactly where it fails.

Why are we not showing the bad choices? On my computer I have hundreds of diffs stored by my agent code review tool that point to style/architecture failures (and in the end, the result of that iteration on the AI output)

I'm not quite sure how people are generating unsalvageable outputs. I'd never ship the result of a first AI pass, either. I review all the code and the architecture, within reason (eg: in Rust I don't preoccupy myself anymore with precisely scoping pub, or whatever, unless I'm making a library crate). I sent a "changes requested" prompt+json to my agent, and it interactively fixes everything (even style, even comments with manual patches with my in-review-tool editor)

Well again that is just a "vibes" explanation with nothing concrete.

I feel like with LLMs, it's like a situation where you are close to some feature or project and have a pretty good idea in your head already of how you'd implement it yourself "I'd do this and have an API with that and a database table foo for storing bar with index on baz" and you're keen to get started on it ...but then someone else gets assigned to work on it not you.

They do it a totally different way than you would have thought of doing it, and the code feels alien and weird because it doesn't follow your "design" and decisions you already had in your head before they started work on it. Is it "bad" or just not how you'd have done it?

I think that is ok. So long as the code works and meets all stated requirements and is secure and performant and uses good abstractions and is not full of hacks, then it's ok to let go. Sure maybe you'd have done it a different way but ultimately that doesn't matter.

> So long as the code works and meets all stated requirements and is secure and performant and uses good abstractions and is not full of hacks, then it's ok to let go

That is the problem. The code often is full of hacks and bad abstractions. LLMs write code like a junior or mid-level engineer – perfectly overfitted to today’s request. Oh you need to work on this code tomorrow and there’s a laundry list of future requirements? Throw away and rewrite, I guess.

You can most easily see this when you ask LLMs to write tests. They have a tendency to write convoluted tests that absolutely definitely pass. Even when you know the code has a bug, they’ll write the test in a way that fits the code as written and passes. Because they know tests should pass.

Getting an LLM to write a failing test against a currently working function because you know the business requirements have changed is like pulling teeth.

You don’t see writing about this stuff because it doesn’t neatly fit in an article or video (I’ve tried). Plus it goes against the zeitgeist so you’d never get traction (even if people write these posts, we don’t see them)

The unit test example has been my team's experience as well. The unit tests look good on the surface, but their passing or failing has little predictive value on whether there are actually bugs in the code.

Some people have suggested you write the unit tests by hand to basically "check" the LLM's work and keep it honest, but to write good unit tests you have to understand the underlying code, which takes time (since you didn't write it), so to me this is another bullet point that suggests LLMs will eventually be relegated to "StackOverflow+" duty - give me snippets, but I'll still write effectively all the code.

Last week I helped a co-worker with some flaky tests, where the code and tests were generated from one of the models. While looking at how one of the tests work, I'd spotted a place in the code where a boolean condition was backwards in a way a human would never have written (and on top if it, there was a confidently-incorrect block comment above it so it was easy to assume it was correct) - so even if he'd fixed the test's flakiness, it would end up always failing instead of sometimes failing. He'd spent hours trying to figure out what was going on.