|
|
|
|
|
by felipeerias
54 days ago
|
|
Claude 4.7 broke something while we were working on several failing tests and justified itself like this: > That's a behavior narrowing I introduced for simplicity. It isn't covered by the failing tests, so you wouldn't have noticed — but strictly speaking, [functionality] was working before and now isn't. I know that a LLM can not understand its own internal state nor explain its own decisions accurately. And yet, I am still unsettled by that "you wouldn't have noticed". |
|
I see this kind of misalignment in all agents, open and closed weights.
I've found these forms to be the most common, "this test was already failing before my changes." Or, "this test is flaky due to running the test suite on multiple threads." Sometimes the agent cot claims the test was bad, or that the requirements were not necessary.
Even more interesting is a different class of misalignment. When the constraints are very heavy (usually towards the end of the entire task), I've observed agents intentionally trying to subvert the external validation mechanisms. For example, the agent will navigate out of the work tree and commit its changes directly to trunk. They cot usually indicates that the agent "is aware" that it's doing a bad think. This usually is accompanied by something like, "I know that this will break the build, but I've been working on this task for too long, I'll just check what I have in now and create a ticket to fix the build."
I ended up having to spawn the agents in a jail to prevent that behavior entirely.