| > Claude is good at programming “in the small”. It’s good at getting self contained, well scoped tasks done. But it’s bad at large scale system thinking. Over time, every project I’ve gotten Claude to write has become riddled with poor design choices layered on top of one another until even Claude struggles to make forward progress. What's really fun about these tools is that this is both true and false! If you ask these tools to reason about things in-the-large you often get very useful information back out of them. I've asked about refactors I was thinking about doing, and gotten accurate and useful information back, which as been AMAZINGLY helpful for avoiding "let me start doing this, and then realize 4 hours in that it's not gonna work as well as I hoped" traps. But it's a very attention-on-one-thing-at-a-time thing. IMO this is fairly inherent to the models, but people have been doing great work making the tooling around them smarter in terms of how to break up tasks ahead of time to compensate, so I'm not gonna say it won't get materially better. So if you prompt it to do a task in a certain way, especially in a "plan mode" type of usage, you can get a pretty solid recipe + execution of a properly-designed implementation of that task. But if you're not opinionated and checking in frequently, you're gonna get the sorta median-approach or random-luck-output-of-the-day decision. And so the human-in-the-loop point is unlikely to go away as long as the human has more context. Even if it's half-baked or not-fully-realized intuition about how the code is likely to evolve in the future that you don't put into every prompt. > It’s also strangely bad at correctness. You can ask it to write unit tests for a project. Unless you’re careful with your prompting, it will only write the unit tests that it knows will pass. My hunch is that this is the same fundamental problem. When it's attention is fully on "produce the next string of code" the parts of the context that relate to the broader system goals are NOT being considered as much for the output. So you get things like this, even with latest Opus still, when dealing with hard-to-isolate-in-a-single-test bugs (esp when it comes to multi-service call sequences or concurrent code): - "we need to fix this bug across eight methods in three files" - "I found the spot! we need to do [blah blah blah]" - "great, implement that plan" - "I've done it!" - "wait a sec... you moved some of the sequencing around, but didn't actually fix the fundamental issue" - "you're right! i moved [xyz] into [func b] instead of [func a] since it needed to be called later, but actually it needs to be after [func c] since it depends on the output of func b!" When asked about correctness it's good enough at "reasoning"-style output to spot these issues, but when generating code it's in such a pure "predict plausible code sequences" mode that this can get lost. |
Its weird, working with LLMs. There are some things the LLMs are extremely good at doing autonomously. Like, reverse engineering, or reading documentation (and using that knowledge in other areas). There are things that they can do - but you need to explicitly prompt. Like, I've found Opus is quite good at optimising code. I've had a lot of success by asking it to write benchmarks (and do profiling), and use that data to improve the performance of some piece of code. Thats often enough to get quite large performance improvements. You can get even further by showing it similar code others have written which is well optimised. It's very good at copying optimisation ideas from one project to another.
But then there are very simple things it really struggles to do. Some kinds of correctness testing. Invariants. System design.
Is it bad at that stuff, or do I just need to figure out how to prompt the LLM? To return to the topic of this thread, I think we're seeing a lot of different opinions on LLM generated code for 3 reasons:
1. Some people aren't looking at claude's output at all. Some people are looking at the code and it looks fine to them. And some people (with more experience writing software) are looking at the code and judging it poorly.
2. We all prompt our LLMs very differently! It turns out that you get really different results based on how you prompt the machine. We're all figuring this out together. Some people have better instincts than others.
3. We're working on different projects. Claude is comparatively much better at end-user facing software. Its great at making a standalone website. Its much less good at finding and fixing obscure bugs in large, established pieces of software. If you work in consulting, LLMs can already do a lot of your job. If you work on Chrome or Unreal or the windows kernel, maybe not so much.