Hacker News new | ask | show | jobs
by duxup 43 days ago
Yeah I'd like to know in a solid way WHY Claude kept changing a file that I explicitly told it not to. The .mds, Claude's plan all said not to touch that file, and Claude just kept at it. I've had it happen repeatedly lately. Really basic failures.

The idea being that as frustrating as it is, if I knew why I might be able to do something about it.

But no, we have the black box, where sometimes what comes out just is brain dead and the rate that you get bad output is a mystery...

It feels like gambling at times.

2 comments

I think it's simply a context thing, and LLMs can go blind to any part of the instructions at any time, possibly when exploring complex micro tasks that create their own layers of context within them. That's how the pattern feels to me. Parallel to a limit on the number of things a human can hold in its head at the same time. The more complex the thinking involved becomes the bigger the self generated context becomes, too, it doesn't seem like an easily fixed problem to me other than to have an extremely small "mission critical instructions" context that are surfaced in a more impossible-to-ignore way.
I am by no means an expert, but I'd like to offer my mental model - up to you to decide if it is solid or not, but it works for me.

I think the core intuition is that, like with any other "rasterized" system with finite memory that cannot encode an absence of anything - relation, concept, entity, LLM cannot encode an absence of something through its internal weights. Say, you can have "Product" or "Order" tables in you database, but you cannot have "NotAProduct" or "NotAnOrder" tables - for obvious reasons of such relations being infinite and uncountable. So, to establish an absence of Product or Order your application must execute a "search" operation through the relevant tables. But in LLM-space "search" operation does not exist. It is mathematically undefined. LLM arrives at output (or "what to do") through a sequence of projections of input token vector through its "latent space". It "moves toward" high-probability clusters, fundamentally unable to "move away". So, the success of any "negation" in the prompt ("don't touch this file", "draw me a ballot box without a flag on it") depends on how heavily such scenario represented in the training data/model space. And again, the absence-of-something may be hard-to-impossible to usefully encode, especially if "something" is not fixed. Therefore, to expect "don't touch this file" sentence to result in, well, not touching the file is pure gambling. Sometimes it may look like working, albeit for wrong reasons, and some other times LLM may do exactly the opposite - because its weight matrix statistically pushes it towards "touch this file", completely ignoring (nonexistent in its latent space) "don't".

There is no way to reliably know what will work, and no "skill" or "art" in this. Well, no more than in dice rolling or horoscope casting.

I'd like to add that for the above reason I find "agentic development" usefulness on par with avian remains reading. But when I explored it two practical advises seemed to be helpful in nudging LLM around negation problem:

- Omit the "don't" prompt completely, thus not creating a false "attractor" for LLM; and

- Provide an alternate positive directive ("what to DO", not "what to NOT DO") to act as "escape hatch" when LLM might "want" to touch the sacred file or drop the production DB.

While it looked like somewhat working, I think it is trivially obvious that trying to predict all the nonsense LLM might want to perform and coming up with possible "escape hatches" for everything very quickly becomes utterly impractical.