|
|
|
|
|
by FartyMcFarter
52 days ago
|
|
Software engineering is not a new field. Best practices on testing are mature now, and Anthropic has poached enough engineers from companies with a solid understanding of those practices. Yet, their flagship product got three really bad changes shipped into it and only resolved after more than a month. This raises another question: with all the industry-wide boasting about AI-driven productivity, why does the leading company in agentic coding take over a month to fix severe customer-reported issues? |
|
My unfounded suspicion: because this is the tradeoff we're all facing and for the most part refusing to accept when transitioning over to LLM-driven coding. This is exactly how we're being trained to work by the strengths and limitations of this new technology.
We used to depend on maintaining a global if incomplete understanding of a whole system. That enabled us to know at a glance whether specs and tests and actual behavior made sense and guided our thinking, enabling us to know what to look at. With agentic coding, the brutal truth is that this is now a much less "efficient" approach and we'll ship more features per day by letting that go and relying on external signs of behavior like test suites and an agent's analysis with respect to a spec. It enables accomplishing lots of things we wouldn't have done before, often simply because it would be too much friction to integrate it properly -- write tests, check performance, adjust the conceptual understanding to minimize added complexity, whatever.
So in order to be effective with these new tools, we're naturally trained to let go of many of the things we formerly depended on to keep quality up. Mistakes that would have formerly been evidence of stupidity or laziness are now the price to pay for accelerate productivity, and they're traded off against the "mistakes" that we formerly made that were less visible, often because they were in the form of opportunity cost.
Simple example: say you're writing a simple CLI in Python. Formerly, you might take in a fixed sequence of positional arguments, or even if you did use argparse, you might not bother writing help strings for each one. Now because it's no harder, the command-line processing will be complete and flexible and the full `--help` message will cover everything. Instead, you might have a `--cache-dir=DIR` option that doesn't actually do anything because you didn't write a test for it and there's no visible behavioral change other than worse performance.
Closely related, what do you do with user feedback and complaints? Formerly they might be one of your main signals. Now you've found that you need dependable, deterministic results in your test suite that the agent is executing or it doesn't help. User input is very very noisy. We're being trained away from that. There'll probably be a startup tomorrow that digests user input and boils out the noise to provide a robust enough signal to guide some monitoring agent, and it'll help some cases, and train us to be even worse at others.