| It behaved proactively in one scenario. Perhaps, when it doesn't have tricks in its sleeve, it doesn't do that. The text is not an evaluation of a major trend in behavior (which could be true or false). Another way to frame it, is that it has more weight on training data for some kinds of debugging sessions. It doesn't mean it wants to be more debuggey. That manifests as it appearing to do more work because it engages on those weights. It's likely that Anthropic had a lot of sessions with Claude Code and some way to evaluate if they were successful or not, which became training data. For trivial work, it's likely to be a lot of them. Those sessions are likely to be software developers doing software developer debugging things, not malicious actors doing nasty things. The danger is someone who can coerce those tricks into performing that. Register (that posture of "let's debug and be creative and verify") often comes with a content bias in LLMs (and humans too). The point here is that for a human, you can expect a devious one to be always devious, but LLMs might manifest drastically different register modes depending on the subject. |