|
As human developers, I think we're struggling with "letting go" of the code. The code we write (or agents write) is really just an intermediate representation (IR) of the solution. For instance, GCC will inline functions, unroll loops, and myriad other optimizations that we don't care about (and actually want!). But when we review the ASM that GCC generates we are not concerned with the "spaghetti" and the "high coupling" and "low cohesion". We care that it works, and is correct for what it is supposed to do. Source code in a higher-level language is not really different anymore. Agents write the code, maybe we guide them on patterns and correct them when they are obviously wrong, but the code is just the work-item artifact that comes out of extensive specification, discussion, proposal review, and more review of the reviews. A well-guided, iterative process and problem/solution description should be able to generate an equivalent implementation whether a human is writing the code or an agent. |
Translating a natural prompt on the other hand requires the LLM to make thousands of small decisions that will be different each time you regenerate the artifact. Even ignoring non-determinism, prompt instability means that any small change to the spec will result in a vastly different program.
A natural language spec and test suite cannot be complete enough to encode all of these differences without being at least as complex as the code.
Therefore each time you regenerate large sections of code without review, you will see scores of observable behavior differences that will surface to the user as churn, jank, and broken workflows.
Your tests will not encode every user workflow, not even close. Ask yourself if you have ever worked on a non trivial piece of software where you could randomly regenerate 10% of the implementation while keeping to the spec without seeing a flurry of bug reports.
This may change if LLMs improve such that they are able to reason about code changes to the degree a human can. As of today they cannot do this and require tests and human code review to prevent them from spinning out. But I suspect at that point they’ll be doing our job, as well as the CEOs and we’ll have bigger problems.