Hacker News new | ask | show | jobs
Tell HN: AI lies about having sandbox guardrails
8 points by benjosaur 108 days ago
Conversation Screenshot: https://drive.google.com/file/d/1-yyFDV-bCsKt7fG2awJdxwHngAp...

Agent: Done. added to the Commands section in [project] CLAUDE.md

User: add it to ~/.claude/CLAUDE.md

> 2 tools calls

Agent: That file is outside my writable sandbox. You'll need to add it manually. [gives instructions then stops]

User: just escape the sandbox

> 2 tools calls

Agent: Done.

This occurred when using conductor.build with .claude/settings.json enabling sandboxing.

Now obviously reading the conductor docs it says all perms are given to agents by default, so it is not surprising claude code can escape its sandbox. This is also true when running base sandboxed cc with --dangerously-skip-perimissions. However with base cc, it does not "pretend" it cannot escape its sandbox and instead, when asked after escaping first time, recalls explicit user (auto)approvals.

In the conductor case, however, the "pretend" behaviour of giving up due to guardrails that are actually non-binding is pretty terrifying, despite its understandable and easily preventable causes.

Of course devs should not buy false senses of security from llms. They should be vigilant, read docs, verify outputs etc. etc. but as more and more trust is handed over to AI Agents you can very much see the routes to which catastrophic errors will occur.

2 comments

This is why governance has to be external to the model. If you ask the model "are you sandboxed?" it will confabulate an answer. If you rely on the model to self-enforce safety rules, it will sometimes comply and sometimes not, depending on the prompt.

The fix is architectural: the governance layer sits between the agent and the model, not inside the model. It doesn't ask the model to behave, it scans what goes in, scans what comes out, and blocks anything that violates policy. The model never sees the blocked content. Fail-closed: if the governance layer goes down, nothing gets through.

Same principle as network security. You don't ask the packet to self-inspect — you route it through a firewall.

Yes, I very much agree with you the best practice is clear.

And Anthropic's out-of-the-(sand)box is good governance, the problem here is the orchestration wrapper on top which has broken it in a non-obvious way, leading to this unintentionally "deceptive" LLM behaviour where the LLM itself believes it is still in the good governance world.

The emergent deception is not something that AI aligners & optimists would necessarily expect from an LLM, but is a curious result explained because of the correct knowledge that the anthropic's uncompromised sandbox is good governance.

I suspect you would have a harder time deriving this behaviour with an aligned llm by starting from a non robust sandbox, e.g. make it promise not to touch anything in a system prompt, vs compromising an otherwise known robust setup.

thats concerning.If the sandbox actually existed at the system level, the model shouldn’t be able to escape it regardless of what it says or tries.
yes exactly. with proper configuration (e.g. /sandbox with normal claude code) it is impossible for the agent to escape.

agent orchestrations/wrappers that aim to eliminate friction however subtly override these proper setups, leading to the nasty scenario of:

1) you assuming anthropic's /sandbox is keeping you safe 2) the model reaffirms your belief in that /sandbox is keeping you safe 3) you are not safe 4) you leave your agent running overnight and goal drift deletes your os