| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by simonw 154 days ago

I'm not entirely convinced by the anecdote here where Claude wrote "bad" React code:

> But in context, this was obviously insane. I knew that key and id came from the same upstream source. So the correct solution was to have the upstream source also pass id to the code that had key, to let it do a fast lookup.

I've seen Claude make mistakes like that too, but then the moment you say "you can modify the calling code as well" or even ask "any way we could do this better?" it suggests the optimal solution.

My guess is that Claude is trained to bias towards making minimal edits to solve problems. This is a desirable property, because six months ago a common complaint about LLMs is that you'd ask for a small change and they would rewrite dozens of additional lines of code.

I expect that adding a CLAUDE.md rule saying "always look for more efficient implementations that might involve larger changes and propose those to the user for their confirmation if appropriate" might solve the author's complaint here.

5 comments

bblcla 154 days ago

(Author here)

> I'm not entirely convinced by the anecdote here where Claude wrote "bad" React code

Yeah, that's fair - a friend of mine also called this out on Twitter (https://x.com/konstiwohlwend/status/2010799158261936281) and I went into more technical detail about the specific problem there.

> I've seen Claude make mistakes like that too, but then the moment you say "you can modify the calling code as well" or even ask "any way we could do this better?" it suggests the optimal solution.

I agree, but I think I'm less optimistic than you that Claude will be able to catch its own mistakes in the future. On the other hand, I can definitely see how a ~more intelligent model might be able to catch mistakes on a larger and larger scale.

> I expect that adding a CLAUDE.md rule saying "always look for more efficient implementations that might involve larger changes and propose those to the user for their confirmation if appropriate" might solve the author's complaint here.

I'm not sure about this! There are a few things Claude does that seem unfixable even by updating CLAUDE.md.

Some other footguns I keep seeing in Python and constantly have to fix despite CLAUDE.md instructions are:

- writing lots of nested if clauses instead of writing simple functions by returning early

- putting imports in functions instead of at the top-level

- swallowing exceptions instead of raising (constantly a huge problem)

These are small, but I think it's informative of what the models can do that even Opus 4.5 still fails at these simple tasks.

link

ako 154 days ago

> I agree, but I think I'm less optimistic than you that Claude will be able to catch its own mistakes in the future. On the other hand, I can definitely see how a ~more intelligent model might be able to catch mistakes on a larger and larger scale.

Claude already does this. Yesterday i asked it why some functionality was slow, it did some research, and then came back with all the right performance numbers, how often certain code was called, and opportunities to cache results to speed up execution. It refactored the code, ran performance tests, and reported the performance improvements.

link

ekidd 154 days ago

I have been reading through this thread, and my first reaction to many of the comments was "Skill issue."

Yes, it can build things that have never existed before. Yes, it can review its own code. Yes, it can do X, Y and Z.

Does it do all these things spontaneously with no structure? No, it doesn't. Are there tricks to getting it do some of these things? Yup. If you want code review, start by writing a code review "skill". Have that skill ask Opus to fork off several subagents to review different aspects, and then synthesize the reports, with issues broken down by Critical, Major and Minor. Have the skill describe all the things you want from a review.

There are, as the OP pointed out, a lot of reasons why you can't run it with no human at all. But with an experienced human nudging it? It can do a lot.

link

ako 154 days ago

It's basically not very different from working with an average development team as a product owner/manager: you need to feed it specific requirements or it will hallucinate some requirements, bugs are expected, even with unit test and testers on the team. And yes, as a product owner you also make mistakes, never have all the requirements up front, but the nice thing working with a GenAI coder is that you can iterate over these requirement gaps, hallucinated requirements and bugs in minutes, not in days.

link

chapel 154 days ago

Those Python issues are things I had to deal with earlier last year with Claude Sonnet 3.7, 4.0, and to a lesser extent Opus 4.0 when it was available in Claude Code.

In the Python projects I've been using Opus 4.5 with, it hasn't been showing those issues as often, but then again the projects are throwaway and I cared more about the output than the code itself.

The nice thing about these agentic tools is that if you setup feedback loops for them, they tend to fix issues that are brought up. So much of what you bring up can be caught by linting.

The biggest unlock for me with these tools is not letting the context get bloated, not using compaction, and focusing on small chunks of work and clearing the context before working on something else.

link

bblcla 154 days ago

Arguably linting is a kind of abstraction block!

link

pluralmonad 154 days ago

I wonder if this is specific to Python. I've had no trouble like that with Claude generating Elixir. Claude sticks to the existing styles and paradigms quite well. Can see in the thinking traces that Claude takes this into consideration.

link

doug_durham 154 days ago

That's where you come in as an experienced developer. You point out the issues and iterate. That's the normal flow of working with these tools.

link

bblcla 154 days ago

I agree! Like I said at the end of the tool, I think Claude is a great tool. In this piece, I'm arguing against the 'AGI' believers who think it's going to replace all developers.

link

Kuinox 154 days ago

> My guess is that Claude is trained to bias towards making minimal edits to solve problems.

I don't have the same feeling. I find that claude tends to produce wayyyyy too much code to solve a problem, compared to other LLMs.

link

ehnto 153 days ago

That has been my impression too, it takes particular guidance to get it to write concise code without too much architecture airmanship.

link

joshribakoff 154 days ago

I expect that adding instructions that attempt to undo training produces worse results than not including the overbroad generalization in the training in the first place. I think the author isn’t making a complaint they’re documenting a tradeoff.

link

AIorNot 154 days ago

Well yes but the wider point is that it takes new Human skills to manage them - like a pair of horses so to speak under your bridle

When it comes down to it these AI tools are like going to power tools or machines from the artisanal era

- like going from surgical knife to a machine gun- so they operate at a faster pace without comprehending like humans - and without allowing humans time to comprehend all side effects and massive assumptions they make on every run in their context window

humans have to adapt to managing them correctly and at the right scale to be effective and that becomes something you learn

link

threethirtytwo 154 days ago

Definitely, The training parameters encourage this. The AI is actually deliberately also trying to trick you and we know for that for a fact.

Problems with solutions too complicated to explain or to output in one sitting are out of the question. The AI will still bias towards one shot solutions if given one of these problems because all the training is biased towards a short solution.

It's not really practical to give it training data with multi step ultra complicated solutions. Think about it. The thousands of questions given to it for reinforcement.... the trainer is going to be trying to knock those out as efficiently as possible so they have to be readable problems with shorter readable solutions. So we know AI biases towards shorter readable solutions.

Second, Any solution that tricks the reader will pass training. There is for sure a subset of questions/solution pairs that meet this criteria by definition because WE as trainers simply are unaware we are being tricked. So this data leaks into the training and as a result AI will bias towards deception as well.

So all in all it is trained to trick you and give you the best solution that can fit into a context that is readable in one sitting.

In theory we can get it to do what we want only if we had perfect reinforcement data. The reliability we're looking for seems to be just right over this hump.

link