|
|
|
|
|
by josephg
5 days ago
|
|
Yes, I’m in this camp. I’ve been pushing forward some personal projects lately using LLMs. At first, I was delighted at how productive I was, prompting. But over time a lot of cracks have started to show. Claude is good at programming “in the small”. It’s good at getting self contained, well scoped tasks done. But it’s bad at large scale system thinking. Over time, every project I’ve gotten Claude to write has become riddled with poor design choices layered on top of one another until even Claude struggles to make forward progress. And at that point, what do you do? You’ve gotta read all its code. Something I’ve learned I should just do from the start to save myself a lot of time later on. It’s also strangely bad at correctness. You can ask it to write unit tests for a project. Unless you’re careful with your prompting, it will only write the unit tests that it knows will pass. Generally, these models are amazing tools. But they cannot be trusted to make correct, maintainable software. At least not yet. Maybe in another year or two. |
|
What's really fun about these tools is that this is both true and false!
If you ask these tools to reason about things in-the-large you often get very useful information back out of them.
I've asked about refactors I was thinking about doing, and gotten accurate and useful information back, which as been AMAZINGLY helpful for avoiding "let me start doing this, and then realize 4 hours in that it's not gonna work as well as I hoped" traps.
But it's a very attention-on-one-thing-at-a-time thing. IMO this is fairly inherent to the models, but people have been doing great work making the tooling around them smarter in terms of how to break up tasks ahead of time to compensate, so I'm not gonna say it won't get materially better.
So if you prompt it to do a task in a certain way, especially in a "plan mode" type of usage, you can get a pretty solid recipe + execution of a properly-designed implementation of that task.
But if you're not opinionated and checking in frequently, you're gonna get the sorta median-approach or random-luck-output-of-the-day decision. And so the human-in-the-loop point is unlikely to go away as long as the human has more context. Even if it's half-baked or not-fully-realized intuition about how the code is likely to evolve in the future that you don't put into every prompt.
> It’s also strangely bad at correctness. You can ask it to write unit tests for a project. Unless you’re careful with your prompting, it will only write the unit tests that it knows will pass.
My hunch is that this is the same fundamental problem. When it's attention is fully on "produce the next string of code" the parts of the context that relate to the broader system goals are NOT being considered as much for the output. So you get things like this, even with latest Opus still, when dealing with hard-to-isolate-in-a-single-test bugs (esp when it comes to multi-service call sequences or concurrent code):
- "we need to fix this bug across eight methods in three files"
- "I found the spot! we need to do [blah blah blah]"
- "great, implement that plan"
- "I've done it!"
- "wait a sec... you moved some of the sequencing around, but didn't actually fix the fundamental issue"
- "you're right! i moved [xyz] into [func b] instead of [func a] since it needed to be called later, but actually it needs to be after [func c] since it depends on the output of func b!"
When asked about correctness it's good enough at "reasoning"-style output to spot these issues, but when generating code it's in such a pure "predict plausible code sequences" mode that this can get lost.