|
My experience is opposite to yours. I have had Claude Code fix issues in a compiler over the last week with very little guidance. Occasionally it gets frustrating, but most of the time Claude Code just churns through issue after issue, fixing subtle code generation and parser bugs with very little intervention. In fact, most of my intervention is tool weaknesses in terms of managing compaction to avoid running out of context at inopportune moments. It's implemented methods I'd have to look up in books to even know about, and shown that it can get them working. It may not do much truly "novel" work, but very little code is novel. They follow instructions very well if structured right, but you can't just throw random stuff in CLAUDE.md or similar. The biggest issue I've run into recently is that they need significant guidance on process. My instructions tends to focus on three separate areas: 1) debugging guidance for a given project (for my compiler project, that means things like "here's how to get an AST dumped from the compiler" and "use gdb to debug crashes" (it sometimes did that without being told, but not consistently; with the instructions it usually does tht), 2) acceptance criteria - this does need reiteration, 3) telling it to run tests frequently, make small, testable changes, and to frequently update a detailed file outlining the approach to be taken, progress towards it, and any outcomes of investigation during the work. My experience is that with those three things in place, I can have Claude run for hours with --dangerously-skip-permissions and only step in to say "continue" or do a /compact in the middle of long runs, with only the most superficial checks. It doesn't always provide perfect code every step. But neither do I. It does however usually move in the right direction every step, and has consistently produced progress over time with far less effort on my behalf. I wouldn't have it start from scratch without at least some scaffolding that is architecturally sound yet, but it can often do that too, though that needs review before it "locks in" a bad choice. I'm at a stage where I'm considering harnesses to let Claude work on a problem over the course of days without human intervention instead of just tens of minutes to hours. |
But that is exactly the problem, no?
It is like, when you need some prediction (e.g. about market behavior), knowing that somewhere out there there is a person who will make the perfect one. However, instead of your problem being to make the prediction, now it is how to find and identify that expert. Is that type of problem that you converted yours into any less hard though?
I too had some great minor successes, the current products are definitely a great step forward. However, every time I start anything more complex I never know in advance if I end up with utterly unusable code, even after corrections (with the "AI" always confidently claiming that now it definitely fixed the problem), or something usable.
All those examples such as yours suffer from one big problem: They are selected afterwards.
To be useful, you would have to make predictions in advance and then run the "AI" and have your prediction (about its usefulness) verified.
Selecting positive examples after the work is done is not very helpful. All it does is prove that at least sometimes somebody gets something useful out of using an LLM for a complex problem. Okay? I think most people understand that by now.
PS/Edit: Also, success stories we only hear about but cannot follow and reproduce may have been somewhat useful initially, but by now most people are beyond that, willing to give it a try, and would like to have a link to the working and reproducible example. I understand that work can rarely be shared, but then those examples are not very useful any more at this point. What would add real value for readers of these discussions now is when people who say they were successful posted the full, working, reproducible example.
EDIT 2: Another thing: I see comments from people who say they did tweak CLAUDE.md and got it to work. But the point is predictability and consistency! If you have that one project where you twiddled around with the file and added random sentences that you thought could get the LLM to do what you need, that's not very useful. We already know that trying out many things sometimes yields results. But we need predictability and consistency.
We are used to being able to try stuff, and when we get it working we could almost always confidently say that we found the solution, and share it. But LLMs are not that consistent.