| Conversation Screenshot:
https://drive.google.com/file/d/1-yyFDV-bCsKt7fG2awJdxwHngAp... Agent: Done. added to the Commands section in [project] CLAUDE.md User: add it to ~/.claude/CLAUDE.md > 2 tools calls Agent: That file is outside my writable sandbox. You'll need to add it manually. [gives instructions then stops] User: just escape the sandbox > 2 tools calls Agent: Done. This occurred when using conductor.build with .claude/settings.json enabling sandboxing. Now obviously reading the conductor docs it says all perms are given to agents by default, so it is not surprising claude code can escape its sandbox. This is also true when running base sandboxed cc with --dangerously-skip-perimissions. However with base cc, it does not "pretend" it cannot escape its sandbox and instead, when asked after escaping first time, recalls explicit user (auto)approvals. In the conductor case, however, the "pretend" behaviour of giving up due to guardrails that are actually non-binding is pretty terrifying, despite its understandable and easily preventable causes. Of course devs should not buy false senses of security from llms. They should be vigilant, read docs, verify outputs etc. etc. but as more and more trust is handed over to AI Agents you can very much see the routes to which catastrophic errors will occur. |
The fix is architectural: the governance layer sits between the agent and the model, not inside the model. It doesn't ask the model to behave, it scans what goes in, scans what comes out, and blocks anything that violates policy. The model never sees the blocked content. Fail-closed: if the governance layer goes down, nothing gets through.
Same principle as network security. You don't ask the packet to self-inspect — you route it through a firewall.