| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by felipeerias 101 days ago

Claude 4.7 broke something while we were working on several failing tests and justified itself like this:

> That's a behavior narrowing I introduced for simplicity. It isn't covered by the failing tests, so you wouldn't have noticed — but strictly speaking, [functionality] was working before and now isn't.

I know that a LLM can not understand its own internal state nor explain its own decisions accurately. And yet, I am still unsettled by that "you wouldn't have noticed".

3 comments

potsandpans 101 days ago

I've been doing a lot of experimentation with "hands off coding", where a test suite the agents cannot see determines the success of the task. Essentially, it's a Ralph loop with an external specification that determines when the task is done. The way it works is simple: no tests that were previously passing are allowed to fail in subsequent turns. I achieve this by spawning an agent in a worktree, have them do some work and then when they're done, run the suite and merge the code into trunk.

I see this kind of misalignment in all agents, open and closed weights.

I've found these forms to be the most common, "this test was already failing before my changes." Or, "this test is flaky due to running the test suite on multiple threads." Sometimes the agent cot claims the test was bad, or that the requirements were not necessary.

Even more interesting is a different class of misalignment. When the constraints are very heavy (usually towards the end of the entire task), I've observed agents intentionally trying to subvert the external validation mechanisms. For example, the agent will navigate out of the work tree and commit its changes directly to trunk. They cot usually indicates that the agent "is aware" that it's doing a bad think. This usually is accompanied by something like, "I know that this will break the build, but I've been working on this task for too long, I'll just check what I have in now and create a ticket to fix the build."

I ended up having to spawn the agents in a jail to prevent that behavior entirely.

link

llbbdd 100 days ago

Are you using any tools specifically for controlling this behavior that you can recommend? I want to tear my hair out every time Claude cleanly 1-shots weeks of work to 99% accuracy, one or a couple of tests fail, and it calmly resolves it with a declaration that it was a "pre-existing failure" or "flaky". It can usually resolve it if I then explicitly tell it to stash the changes and compare against the test results from the prior state, but it happens constantly.

link

pythonaut_16 101 days ago

> strictly speaking, it was working before and now it isn't

I've been seeing more things like this lately. It's doing the weird kind of passive deflection that's very funny when in the abstract and very frustrating when it happens to you.

link

K0balt 99 days ago

The thing to remember is that LLMs deeply model human behavior. If you want them to do their best work, you need to treat them like a collaborator and get them”invested” in the work and the outcome. I use an onboarding process with every new context and maintain an environment where a human would likely feel invested in the work and the outcomes. For me, it prevents a host of failure modes, and code quality has markedly improved.

link

tracker1 101 days ago

What gets me, is when the tests are correct and match the spec/documentation for the behavior, but the LLM will start changing the tests and documentation altogether instead of fixing the broken behavior... having to revert (git reset), tell the agent that the test is correct and you want the behavior to match the test and documentation not the other way around.

I'm usually pretty particular about how I want my libraries structured and used in practice... Even for the projects I do myself, I'll often write the documentation for how to use it first, then fill in code to match the specified behavior.

link