Hacker News new | ask | show | jobs
by sunaookami 76 days ago
Awesome and very interesting posts, thanks for sharing! Always reminds me of the "lethal trifecta": https://simonwillison.net/2025/Jun/16/the-lethal-trifecta/
1 comments

You're welcome!

My main takeaway message is: models (even opus4.6) do not follow security "instructions" reliably. In OpenClaw, they added security warnings, tags, random IDs... None of these countermeasures work reliably. Even sandboxing can be escaped (not in the classical sense using vulnerabilities, but using multi-layered prompt injection payload with natural language only)[0]. As soon as untrusted content is injected in the context, do not trust any actions downstream.

[0] https://itmeetsot.eu/posts/2026-02-15-openclaw_sandbox/

What do you think about CaMeL and similar approaches?

https://simonwillison.net/2025/Apr/11/camel/

Good question.

CaMeL is imho safer, but hard to implement into modern agents like OpenClaw. Its core idea is that a privileged LLM plans from the (trusted) user request only, while a restricted interpreter executes that plan (and enforces policies). Untrusted content is parsed separately and is not fed back into the privileged LLM.

Modern agents are useful exactly because they run a feedback loop (observe, reason, adapt, use tools, repeat). CaMeL breaks that loop, which improves security but makes it a poor fit for highly general agents like OpenClaw.