Hacker News new | ask | show | jobs
by gymbeaux 1 day ago
Those numbers are abysmal. Should we really be using LLMs to write our code? I have a theory- LLMs can spit out code that gets the job done and looks ok, maybe even great, but contains small “anomalies” that compound over time. An enterprise app developed entirely with LLM-happy devs might end up virtually unmaintainable.

I’m not sure how to explain it, but the more I see LLM-written code the more I feel it’s bad code doing a good job of masquerading as good code. I think this take will become less-hot in the next year or two when we see enterprise greenfield projects that were created entirely with LLM “assistance” go to prod. I think we’ll find that the code is difficult for humans to read, understand, debug, and extend- and I think the larger the codebase the harder it will be for LLMs to maintain. More opportunity for hallucination, larger context windows needed, more tokens bought and spent for smaller and smaller code changes. I think the more code an LLM writes for an app, the worse that codebase becomes.

8 comments

I can't help but feel that people continually underestimate how bad human written code becomes over time. The exception is probably single-person passion projects or open source projects that maintain quality governance over time.

I strongly suspect most closed source code developed under commercial or internal pressure is pretty awful after a few years of development.

All LLM code has to do is suck less than existing code. And that's presuming the code quality doesn't improve as the models, the harnesses and our ways of working with them improve.

Sucky human-written code is still based on human understanding, which can change over time, be readjusted or solidified. People implement something wrong once, then update their perspective, then in the future does it right.

LLMs doesn't have this benefit. You forget to add the correct to the system prompt, and the LLM will repeat the same mistake over and over, and worse than that, their mistakes aren't based on their understanding, it's basically random guesses.

Humans, even bad coders, still seem to have some sort of architecture in mind, even if it's spaghetti, whereas LLMs (obviously) don't think more than a few steps, and never about the full scope of what they're contributing too, and on purpose too, because you want the context to be as small as possible when you work with LLMs.

With LLMs you need to thread carefully between "What does the LLM need to know?" and "Can I skip passing this to the LLM this time?" while a human you can more or less dump them everything you sit on, and let them shift it through, and they'll mostly make it out OK.

> their mistakes aren't based on their understanding, it's basically random guesses

Whilst I don't claim any true "understanding" as that is a very loaded term that doesn't mean it's just random guesses.

Anyone using recent LLM coding agents on a regular basis would probably agree that there's something going on that fits some non-athropomorphizing, non-sentience-assigning definition of "understanding"

As for the point about improvement - I think that's an orthogonal issue to the overall code quality. With regard to human codebases - there's plenty of scenarios that negate the improvement of individuals. We're comparing organizations with LLMs - not individuals with LLMs and that makes a significant difference.

>their mistakes aren't based on their understanding, it's basically random guesses.

Not random across their whole training set. Random across related concepts bundled together in the training set. Which is not that dissimilar to human mistakes.

A human's mistakes are also based on going from one option in their training and not another, where the two are close together but one is not appropriate and doesn't fully cover the expected result.

That's obvious in a typo (you get close to the target word but miss it just so), but also in off by one errors (you're still in the proximity of the correct loop you should have written), all the way to picking the wrong architecture or patter n (you still chose among patterns for the worse fit you've picked, you don't suddenly start using cooking recipes).

> while a human you can more or less dump them everything you sit on, and let them shift it through, and they'll mostly make it out OK

i dont see why software engineers are paid so well, and are so hard to hire?

just dump a bunch of requirements on a homeless person and itll just work out

I have no idea what point you're making here.
He's making the point that the idea that "with a human you can more or less dump them everything you sit on, and let them shift it through, and they'll mostly make it out OK" is wrong.

Humans absolutely need lots of training to be able to get this dump and do it right, and even when trained, different humans or humans at different coding tasks can be totally way worse and way better at it.

I think the real issue might be that how “good” the code is matters less than being able to form a mental model for what the human who wrote the code was “thinking”. If written by a machine, this contract is broken and we get more confused, even if our traditional methods of evaluating the code come out equal.
Yes, thank you for wording it better. When I read through for example an entire codebase that was ~99% written by AI, it's "inconsistent" in a way that even a shared-by-humans codebase would not be. I think this arises from the AI misunderstanding slightly what is being asked - the AI misunderstands, but can still (at least in some cases) output code that does what it needs to. It may also do other things that it doesn't need to do, or may do the thing in a suboptimal, not-so-maintainable way, but the UI works and that's enough for most non-technical people.
And where do you think the LLM learned coding from?

But anyway, let the LLM verify the code to give advice on improvements but don't let it write code unverified. That's my opinion on it anyway.

If I have to verify the code then I don't see a point in using it to write more than a single method at a time, and that method should be simple enough that I can take a very quick glance and be able to tell it's correct - something like a method that writes an array to a CSV. I don't have that code memorized, especially in the various languages I regularly work in, but I know it when I see it. Anything more complex than that and I think it would take me as much or more time to truly verify the AI's output than to just write it myself.
That doesn’t help the developers who have high standards.
Yes. But that's not the point I'm addressing.
I've been sent code from vendors that didn't even compile, long before llms were a thing. Most shops that aren't primarily software have really really terrible software.
True. I used to get code that wouldn't compile all the time from Infosys "developers" in India circa 2016. Perhaps now with LLMs they still do basically no work, but at least the code compiles? That being said, I'm not sure that paying for Opus/GPT/Gemini makes sense for a company like Infosys that caters to the dumb C-Levels of large corporations who think "why pay X for U.S. devs when we can pay X/5 for foreign devs?" - such companies are fucked in the long-term anyway. Why would Infosys voluntarily pay gobs of money when such U.S. corporations seem to be content with their output as-is?
Not my observation. If you never look at the code and dont have basic guardrails in place (linters, architecture tests, some guidelines for best practices) - probably.

But as soon as you do minimal reviews and high-level corrections, applications turn out just fine.

Can there be bugs? Sure. That's the price of not reading or understanding every line. It should depend on the criticality of your software how much of these you tolerate and how much you don't (reviewing, understanding, testing everything 100% like you were used to if you had written it yourself will kill most if not all of your gained speed)

But I never got the impression of unmaintainability or unfixable bugs.

Actually the other side around: A really good cleanup pass, architectural changes, or bugfixes are seldom more than a few prompts and 2 hours away, provided your overall base is decent and you actually gave a fuck from the start.

> Can there be bugs? Sure. That's the price of not reading or understanding every line.

I've yet to come across a human developer who's output would meet this standard, despite writing every line.

In fact, having an LLM review our code is catching quite a few bugs before it reaches QA.

Indeed, though I find the distribution is different.

The humans may skip unit tests and need reminding; the AI always write unit tests once it's in AGENTS.md or whatever, but my experience* was that 5-10% of the time the LLM's attempt at a "test" would, instead of executing the code and examining the results, open the source code as a text file and run a regex to find/exclude certain substrings.

* At the start of this year, because Anthropic and OpenAI were both offering free trials. IDK how much things have changed since then, some things change fast in this domain, other things don't.

I’ve been piloting LLMs for the past six months non stop and we’re at the point where formally verified models generated as an intermediate step between spec and code are very good value.

Riding the exponential means you have to update priors more often.

I have seen some pre-AI over-mocked codebases where the "tests" where essentially that (but harder to read than regex would have been)
What I'm hearing is "thoroughly reviewing AI generated code would defeat the purpose, so we give it a cursory glance and it seems to be decent code", and that's my point - it does indeed seem to be decent code but I think we're all kicking the can down the road when we operate this way. If the alternative means there's no gain to be had by using LLMs to write code, so be it. Maybe that's the answer. Maybe we shouldn't be relying so much on AI to write our code.

I think LLMs are great for writing small snippets of code that really only have one "best answer" (something simple like writing an array to a CSV), and internal tools, where bugs and security vulnerabilities usually aren't a big deal.

Take a look at a sufficiently old random internal repo which was not written with LLMs and compare.

My observation is that they are equally bad and hard to maintain or even more so than the new ones.

One thing I’ve noticed is that the LLM assisted ones have a lot more comments which is nice but take more time to read.

I recognize that I seem to be an above-average developer and so my standards for LLM output are higher. I'm seeing enough comments where people are comparing LLM output to offshore dev output or even onshore mid-level or junior-level output. I guess my response would be something like "why are we celebrating this as a huge success for mankind?" If we dump obscene amounts of electricity and silicon into datacenters, we can produce mediocre or "adequate" software engineers?

I agree that LLM output tends to be absolutely flooded with comments. I think it helps to quickly see "is the output what I expected, does it do everything it should be doing in this method?" but I also think most people don't bother reading the comments.

Yes, LLMs generate technical debt.

And they do it faster than any human developer.

>I have a theory- LLMs can spit out code that gets the job done and looks ok, maybe even great, but contains small “anomalies” that compound over time. An enterprise app developed entirely with LLM-happy devs might end up virtually unmaintainable.

For most enterprise apps, being "unmaintainable" would be an improvement.

I have a theory that LLM generated code in a highly modular style (simple data, pure functions) will be easier to “recover” by a human team when the LLM gets muddled. So Haskell, basically.
Easy fix: Code's basically free now, so just pipe your errors straight into an LLM and get instant patches. Sure, the patches themselves are broken too, but no worries! just pipe those back in again. Code's disposable now, fresh code generated on every request.

On a more serious note, I think the problem will be the inability to handle/maintain the systems once they are too big and nobody has no idea what's inside of them or what they do.

Yeah, it’s so easy to generate code that you can do a whole codebase rewrite in a day.

Is this a good idea? Probably not—in the past we would only do that when the architecture was causing serious problems since it always has tons of behaviors that will accidentally not get carried forward, some of which are load bearing and will cause bugs.

Now we can do it in an afternoon and get the same long term bug behavior.

> code that gets the job done and looks ok, maybe even great, but contains small “anomalies” that compound over time

They clearly are only assistants for the moment, you can use them to do work ... but only if you could do the said work yourself alone in the first place.

I would say "only if you can review said work yourself alone", rather than "do".

I'm an experienced developer, but I don't count myself as a web dev or a python dev; I can review the web and python stuff I get out of the AI (sometimes I need to ask the AI follow-up questions so I can find official documentation for what it did), but I can't write it.

I think you could eventually do it then, it would just take you longer.
If "eventually" counts, I can say I have "run" a marathon (I have walked that distance in one session, or if you don't like that verb I can sum all the various occasions I've run and that sum almost certainly exceeded 42.2 km before I finished school).

But the difference I allude to here is more like how "book reviewer" is a different job than "book author": yes, if you can review a book, you can also write one. Eventually.

Have you worked with enterprise apps? The ones I have used for decades are hot garbages.
Now imagine decades of LLM code. Extrapolating the rate of increase of LoC, the source code ain't gonna fit on hard drives anymore.