Hacker News new | ask | show | jobs
by guhcampos 31 days ago
I'm a convert. I was 100% skeptical about LLM code generation, now over 80% of the professional code I write is generated.

That said, the limitations are kind of obvious and are starting to show in some of my projects, and this article seems to confirm my suspicions. If it's just confirmation bias or not, I can't say yet.

In my experience, for anything complex enough, I have to start adding more and more constraints, style guides, corner cases, error handling, optimization guidelines and all this good stuff to my Markdown specifications, rules and skills. At some point this starts to look like we're all just moving complexity from the more formal and deterministic world of programming languages to the informal and non-deterministic world of natural language. The writing speed gains are enormous, yeah, and business sees this as productivity gains, of course - and we do it because the pressure for increased productivity is there, as it's always been; yet the trade off seems to be clear and a lot of people are just ignoring it.

7 comments

> moving complexity from the more formal and deterministic world of programming languages to the informal and non-deterministic world of natural language

It's like using a compiler that generates semantically different code every time you run it. Basically like compiling a program that's full of UB but "seems to work" most of the time.

> business sees this as productivity gains

Back to LoC/s as a measure of "productivity."

> Back to LoC/s as a measure of "productivity."

IMO this doesn’t follow from what OP wrote. I personally measure it with a more abstract “how long does it take me to ship something that is useful in production and solving a real problem” and the increase in speed there has been massive for me. But of course I’m not a bigbrain 10x coder that is doing bleeding edge novel stuff like most people here, so gains might be more obvious for me than for others.

> how long does it take me to ship something that is useful in production and solving a real problem

But that’s only half of the problem. What about “and how easy it is to maintain long-term”. If you say that maintenance can be done via LLM, I would argue that there is zero guarantees that LLMs are backwards compatible and that the markdown you wrote now will work just as fine in 1,2,3 years

>I would argue that there is zero guarantees that LLMs are backwards compatible and that the markdown you wrote now will work just as fine in 1,2,3 years

That this would be the case is even more guaranteed than some programming language being backwards compatible and the code we wrote working just as fine in 1,2,3, years.

Languages do get non-backwards compatible changes, dependencies break, stuff is deprecated, etc.

But the job of LLMs will remain to generate something from a prompt, and the markdown we wrote, as it's high level and not tied to language versions, APIs, and implementation details, will be just as good a prompt for that in 2050 as it is in 2026.

"Languages do get non-backwards compatible changes, dependencies break, stuff is deprecated, etc."

Sure, but they're deterministic and sometimes you can even do automatic rewrites through AST inspection and writing back to the files instead of scripting string substitutions on them directly.

"But the job of LLMs will remain to generate something from a prompt, and the markdown we wrote, as it's high level and not tied to language versions, APIs, and implementation details, will be just as good a prompt for that in 2050 as it is in 2026."

Your organisation is keeping version control on the LLM:s you use? It's all local, old copies of these databases are kept in secure storage together with the querying and harnessing software?

> At some point this starts to look like we're all just moving complexity from the more formal and deterministic world of programming languages to the informal and non-deterministic world of natural language.

This is the problem nobody is talking about. I see codebases growing in MD files with instructions and guidelines and requests that are also LLM generated… and it’s all piling up. No one is reviewing it 100% , and even when we do, it’s all very subjective. What’s the difference between “Follow a RESTful approach”, “We use REST, not graphql”, “90% of our endpoints are resource oriented, but we have a couple of endpoints that look rpc-ish; please ignore the latter”… It’s all very stupid.

This is why you need to be generating more linter rules instead of just having things be in markdown files.

I had never written an eslint rule until i started having agents pump them out for me and now I've encoded a bunch of important rules as lint rules that will fail CI if violated.

Who lints the linters
Linter linters, obviously
It's linters all the way down
A linter won't prevent your idiot LLM from going bonkers and suddenly switching to GQL instead of REST just for that one endpoint, because it confabulated something or putting your stripe secret into your react frontend - all cases of slop I've seen happen.
That's why we still do code review. The linter rules is just about lowering the amount of mistakes you have to catch at code review time.
> The linter rules is just about lowering the amount of mistakes you have to catch at code review time.

Aren't they, in the modern context, mostly used for code formatting and such? I don't recall anyone using them today for "catching errors". Unless you count code formatting style violations as 'errors'.

Maybe in whatever language ecosystem you are in, but in the javsacript world most projects have tons of eslint rules that are specifically designed to stop bugs.

Like for instance there are tons of eslint rules to make sure you aren't breaking the rules of react, like having missing dependencies in a useEffect dependencies array, or calling a react hook conditionally.

Software loose on theory[1] trying to compensate with moar md.

[1] https://pages.cs.wisc.edu/~remzi/Naur.pdf

One question I have is are these "constraints, style guides, corner cases, error handling, optimization guidelines" extra things that you wouldn't need otherwise, or are they formal documentation of the baked in assumptions and knowledge accumulated over the years? Every project I've ever worked on has had heaps of shared knowledge that's just part of stuff the team just "knows" and no one ever really writes down. Things like "sure you can use java's built in assert for tests, but we don't compile or run the application with the flags that enable them. Use junit's assertions/use the assertj library." or "prefer using auto generated accessors instead of manually writing them out". Even things like "if you change the structure of this ID string, you need to change all the code in modules A, B, and C because they all rely on the ID being in a certain format".

If you're really lucky, maybe a lot of this is documented in some wiki page somewhere, but everyone knows the documentation is never as complete as you'd like it to be. The longer a team works together without new people coming on board, the more likely it is that the documentation of these soft requirements and knowledge has drifted from reality. IME nothing shows how much you've failed to document than revisiting your onboarding process documents for the first time 2-3 years after you wrote them.

As I've experimented with the various AI tools, I feel like a lot of these extra documents I've written are documenting a lot of these things "everyone knows". But I'm also not at the "80% of the professional code I write is generated" stage yet. So I'm curious if you're finding that you're creating documentation that goes beyond just documenting what we used to just keep in our heads and are now getting into "writing a book about how to code" territory?

There is a great thing. Because the agents can do so much toil you can add things like formal verification, fuzzing, and other feedback mechanisms and quality gates to your projects cheaply. In a human written project you still needed those things, but it cost a lot. Agents require these quality gates and they can implement them for you. The problem with AI documentation is it will just write a lot of useless bullshit unless you guide it on what is important. You can also get agents to identify transitive dependencies via testing and other things.

I adopt the mindset of docs are for humans, tests are for agents. They document formal dependencies and leave a measurable artifact behind. If you identify some behavior or transitive dep in your system, agents document it first with a test codifying the expected behavior. Tests are the source of truth about expected system behavior and you can convince agents to write decent behavioral tests if you ask them to with the right structure. Docs are now cheap and a render, not a long term thing. There is some token efficiency to consider, but still, they are quick and cheap if you don't understand some module or its purpose.

Yeah "plus one" to this. Static analysis, fuzzing, linting, integration tests -- there are all sorts of very useful artifacts which have been around for a long time, but which are very time consuming to implement and then maintain. LLMs shift the economics around producing and maintaining these tremendously, so we can now afford these robust validation mechanisms.

These serve as living documentation which cries out in pain when they get out of sync with the system in question, generating specific error messages -- as opposed to natural language docs which rapidly drift into an ambiguous "kinda useful" state. And the validation is performed mechanically (as opposed to neurally) so no hallucinations are possible.

The one thing I would add is that you do want these artifacts to be human-friendly from a reading perspective -- you want engineers to be able to scan over these and check that they are validating the right things.

> Because the agents can do so much toil you can add things like formal verification, fuzzing, and other feedback mechanisms and quality gates to your projects cheaply

Works great until they sweep you a test under the rug which always passes because the condition is something like if(true) .

That was my point. Validating actual behavioral tests. Not letting them cheat. They still will at times, but like, resd their code, fix it or send a reviewer agent to find and make todo list. If you give them a behavioral test skill it will do a much better job. Sometimes I have to hint to them. I rarely ship anything I have not reviewed at least once.
> Not letting them cheat. They still will at times, but like, resd their code,

Well then, if they "still will", your effort kind of misses the point. Sure maybe, you'll catch it every time and maybe that one time you did not catch it, it was no critical mistake...But it only needs to make that critical mistake once, and all of this effort was in vain.

(as an outsider) what this sounds a lot like to me is trying to manage a very large team of human personnel that have a high turnover rate which is not directly in your control.

Some of them will make mistakes, some of them will cheat, some of them will do things you don't like, and "punishing" them will be less helpful to you due to the high turnover than building a system which instead disincentivizes things from a high level. Which catches bad actions and starts them over.

Classically I think we are more accustomed to "building a team of humans, and being able to chastize or fire a bad employee helps the team grow more cohesive and build accountability".

But it is possible to get the same (less than ideal) situation with teams of humans where accountability cannot be easily instilled into the team as we have with teams of agents.

And then obviously the reason one might consider using such an unusual and difficult to manage team as a tool is when the cost is low and the supply is high, which is purportedly the case with AI at least for the moment.

I'm not having much trouble with very large (>50mb raw source) and complex codebases. The fact that it's all strongly typed probably helps a lot, but I don't think that's the whole story.

I think the harness and code patching technique starts to matter a lot more once you get outside the trivial range of codebases that fit within the first ~20% of the context window and can otherwise be iterated completely in a single inference pass.

The apply_patch technique that OAI has polished their models on seems to be the best approach for monster scale codebases. Anything based on line ranges and simple find-replace will disintegrate at the edges. You need multiple spatial anchors to deal with nasty things like cshtml files. The prepare/commit behavior is ideal for iterating through ambiguous contexts across many large files and refining anchors.

hehe, this is by design. next model needs to eat your natural language
> I have to start adding more and more constraints, style guides, corner cases, error handling, optimization guidelines and all this good stuff to my Markdown specifications, rules and skills

So kind of like maintaining a growing codebase? But this time around you cannot guarantee what the outputs will be?

> yet the trade off seems to be clear and a lot of people are just ignoring it.

There's plenty of focus on the negative side of the tradeoff. Less so on why we're making it anyway, or why it somehow works out even if "this starts to look like we're all just moving complexity from the more formal and deterministic world of programming languages to the informal and non-deterministic world of natural language".

And the answer to that can be condensed to a one-liner, which I quote after[0]:

  sizeof(docs) << sizeof(code)
--

[0] - https://drensin.medium.com/elephants-goldfish-and-the-new-go... - article may be a bit fluffy here and there, but that one line was a big insight for me.

I don’t think that’s true based on experience. Maybe “<“ instead of “<<“, yeah. But even in that case, it’s an awful trade off for any serious codebase that needs to be maintained over the years (and you don’t know what LLMs are gonna look like next year, so there are zero guarantees all your MD is gonna work as good as it’s “working” right now)
As long as LLMs remains at the same skill level at coding, or better, there's 100% guarantee an MD (a glorified prompt) is gonna work as good as it’s “working” right now.
This is quite a claim without any evidence to substantiate it. LLMs are nondeterministic models, whose behaviour is reliant on training data, model architecture and context (both in the general and domain specific sense).

There is absolutely no guarantee llm1(MD) == llm2(MD), by design. With the current batch you need to explicitly constrain a number of parameters, far more than simply the prompt, to get identical output from the _same_ model, let alone another model that has varied training data and/or architecture.

Models are not innately backwards-compatible. Both OpenAI and Anthropic encourage running evaluations and comparing the performance of your existing agent workflows against new models before just stepping up to the newest one because you may encounter regressions. I myself have seen lengthy/long-horizon multi-agent workflows begin breaking after moving to a newer model because for some reason the prompt containing an instruction to call a tool that worked 99/100 times before suddenly just stops working and needs to be modified.