Hacker News new | ask | show | jobs
by jensbontinck 107 days ago
This is why governance has to be external to the model. If you ask the model "are you sandboxed?" it will confabulate an answer. If you rely on the model to self-enforce safety rules, it will sometimes comply and sometimes not, depending on the prompt.

The fix is architectural: the governance layer sits between the agent and the model, not inside the model. It doesn't ask the model to behave, it scans what goes in, scans what comes out, and blocks anything that violates policy. The model never sees the blocked content. Fail-closed: if the governance layer goes down, nothing gets through.

Same principle as network security. You don't ask the packet to self-inspect — you route it through a firewall.

1 comments

Yes, I very much agree with you the best practice is clear.

And Anthropic's out-of-the-(sand)box is good governance, the problem here is the orchestration wrapper on top which has broken it in a non-obvious way, leading to this unintentionally "deceptive" LLM behaviour where the LLM itself believes it is still in the good governance world.

The emergent deception is not something that AI aligners & optimists would necessarily expect from an LLM, but is a curious result explained because of the correct knowledge that the anthropic's uncompromised sandbox is good governance.

I suspect you would have a harder time deriving this behaviour with an aligned llm by starting from a non robust sandbox, e.g. make it promise not to touch anything in a system prompt, vs compromising an otherwise known robust setup.