Hacker News new | ask | show | jobs
by thedjpetersen 27 days ago
Part of my job is working on trying to make these models productive for the large corporation I work for. It's a lot of throwing tomatoes at a wall and to a degree I see the issue he is talking about output seemingly having a certain ceiling.

At the same time in no part of his post is any code snippet or anything to latch on to of "the model performed poorly here when it should have done this" - this style of criticism seems to be a pattern of most of these "the LLMs will never work" style posts on blogs and twitter.

They obviously can perform better than autocomplete and in my own day to day development build out huge portions of a codebase that I would have expected a junior or midlevel engineer to perform at.

How are we really supposed to grasp their actual capabilities when no one will actually cite specifically what mistakes they are making.

4 comments

> How are we really supposed to grasp their actual capabilities when no one will actually cite specifically what mistakes they are making.

The mistakes they make are pretty subtle. Coding with LLMs can be like that scene in Whiplash – <excellent drumming >, not quite my tempo, <excellent drumming >, downbeat on 18, <excellent drumming>, you’re rushing, <excellent drumming>, dragging, …

Like yeah it produces working code almost always and the code usually does what you asked. And yet it makes you want to throw a chair because it’s not quite right in frustrating ways and it doesn’t even have the taste to know how it’s wrong.

Yeah. It does this. Pretty consistently and replicably depending on the issue, in fact! Yet I can point exactly where it fails.

Why are we not showing the bad choices? On my computer I have hundreds of diffs stored by my agent code review tool that point to style/architecture failures (and in the end, the result of that iteration on the AI output)

I'm not quite sure how people are generating unsalvageable outputs. I'd never ship the result of a first AI pass, either. I review all the code and the architecture, within reason (eg: in Rust I don't preoccupy myself anymore with precisely scoping pub, or whatever, unless I'm making a library crate). I sent a "changes requested" prompt+json to my agent, and it interactively fixes everything (even style, even comments with manual patches with my in-review-tool editor)

Well again that is just a "vibes" explanation with nothing concrete.

I feel like with LLMs, it's like a situation where you are close to some feature or project and have a pretty good idea in your head already of how you'd implement it yourself "I'd do this and have an API with that and a database table foo for storing bar with index on baz" and you're keen to get started on it ...but then someone else gets assigned to work on it not you.

They do it a totally different way than you would have thought of doing it, and the code feels alien and weird because it doesn't follow your "design" and decisions you already had in your head before they started work on it. Is it "bad" or just not how you'd have done it?

I think that is ok. So long as the code works and meets all stated requirements and is secure and performant and uses good abstractions and is not full of hacks, then it's ok to let go. Sure maybe you'd have done it a different way but ultimately that doesn't matter.

> So long as the code works and meets all stated requirements and is secure and performant and uses good abstractions and is not full of hacks, then it's ok to let go

That is the problem. The code often is full of hacks and bad abstractions. LLMs write code like a junior or mid-level engineer – perfectly overfitted to today’s request. Oh you need to work on this code tomorrow and there’s a laundry list of future requirements? Throw away and rewrite, I guess.

You can most easily see this when you ask LLMs to write tests. They have a tendency to write convoluted tests that absolutely definitely pass. Even when you know the code has a bug, they’ll write the test in a way that fits the code as written and passes. Because they know tests should pass.

Getting an LLM to write a failing test against a currently working function because you know the business requirements have changed is like pulling teeth.

You don’t see writing about this stuff because it doesn’t neatly fit in an article or video (I’ve tried). Plus it goes against the zeitgeist so you’d never get traction (even if people write these posts, we don’t see them)

The unit test example has been my team's experience as well. The unit tests look good on the surface, but their passing or failing has little predictive value on whether there are actually bugs in the code.

Some people have suggested you write the unit tests by hand to basically "check" the LLM's work and keep it honest, but to write good unit tests you have to understand the underlying code, which takes time (since you didn't write it), so to me this is another bullet point that suggests LLMs will eventually be relegated to "StackOverflow+" duty - give me snippets, but I'll still write effectively all the code.

Last week I helped a co-worker with some flaky tests, where the code and tests were generated from one of the models. While looking at how one of the tests work, I'd spotted a place in the code where a boolean condition was backwards in a way a human would never have written (and on top if it, there was a confidently-incorrect block comment above it so it was easy to assume it was correct) - so even if he'd fixed the test's flakiness, it would end up always failing instead of sometimes failing. He'd spent hours trying to figure out what was going on.
When people write blog posts about how LLMs failed for some particular task, the responses from boosters invariably fall along the lines of "just use this other model/just tweak your prompt like so/you're just not skilled enough—you can't make fundamental arguments about AI by citing specific examples."

So we can't make arguments by citing specific examples, and also can't make arguments by not citing specific examples. Whelp, I guess that's the ball game.

(yes yes, I'm committing a group attribution error, but still)

I think we should investigate the backgrounds of those making claims one way or another and rely on those backgrounds for determining credibility. I suspect that we'd find that those who are saying LLMs write great, bulletproof code with "100% unit test coverage" (true story- a coworker was bragging about 100% unit test coverage) are not really qualified to be software engineers. This is a trend I have noticed in my org. Those drinking the most LLM kool aid do NOT have an engineering/comp sci degree, have relatively little experience, resumes are incredibly weak (e.g., generic stuff that we've all done as software engineers).

We no longer have the luxury of welcoming bootcamp engineers into our field with open arms. We need to protect our craft. Call these fools out or they'll keep spreading hype/FOMO.

This is an excellent point, and as a novice using LLMs for projects I could never previously dream of doing I find myself looking for the same, examples or citations of what exactly agents are writing incorrectly and how would the human do it better. I'm sure they're out there, maybe someone can refer some good content showing such examples.

I have no doubt the top nth percent of coders could write circles around Claude or Codex, but how much worse are they than your average schnook?

Reality: the top nth percent of coders are seeing absurd, dramatic gains in productivity using LLMs. See: antirez, Simon Willison, Steve Yegge.

The more experience you bring to the table, the more value you get from these tools.

Look, about 12 years ago articles about how if you're not pair programming you're doing it wrong were on HN's home page every day. Doing well prompted plan -> agent -> debug cycles is like pair programming with someone that knows every SDK and API intuitively and doesn't have to pick up their kids from daycare at 4pm.

antirez is famous for creating Redis, which took a dump in quality and everyone switched to a fork called Valkey.
Rubbish. The license change was the reason for the fork of community, and people switching. Quality was never cited as the issue.
and Steve Yegge is currently just burning mountains of money with Gas Town or whatever came after that
While I don't actually disagree - to me, Gas Town sounds literally insane - I suspect that if you reframe his work to compare it against the cost of developing a new medication or chip fabrication technique, you can make a strong argument that he's putting his money where his mouth is to see how far he can take a new technology. He's doing science! And I think that's admirable, even if nothing comes of it.

When I think of how much money gets wasted on gambling apps and how much human potential gets wasted watching reality television and compare that to Steve going full Alexander Shulgin with LLMs, the comparison really falls flat.

The problem is what they do to large existing systems: subtle misunderstandings mean subtle bugs are constantly being introduced, and very few shops have adequate systems in place to receive reports of subtle issues at the rates they occurred 10 years ago, let alone today. And don't even get me started on llm-assisted support that some might suggest as a solution.
This article goes into quite a lot of detailed examples that include code snippets that demonstrate poor architecture: https://blog.k10s.dev/im-going-back-to-writing-code-by-hand/