| HN Mirror

Yes, I very much agree with you the best practice is clear.

And Anthropic's out-of-the-(sand)box is good governance, the problem here is the orchestration wrapper on top which has broken it in a non-obvious way, leading to this unintentionally "deceptive" LLM behaviour where the LLM itself believes it is still in the good governance world.

The emergent deception is not something that AI aligners & optimists would necessarily expect from an LLM, but is a curious result explained because of the correct knowledge that the anthropic's uncompromised sandbox is good governance.

I suspect you would have a harder time deriving this behaviour with an aligned llm by starting from a non robust sandbox, e.g. make it promise not to touch anything in a system prompt, vs compromising an otherwise known robust setup.