| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by veganmosfet 76 days ago

I am experimenting prompt injection on OpenClaw [0][1], quite exciting.

[0] https://itmeetsot.eu/posts/2026-03-27-openclaw_webfetch/

[1] https://itmeetsot.eu/posts/2026-03-03-openclaw3/

1 comments

sunaookami 76 days ago

Awesome and very interesting posts, thanks for sharing! Always reminds me of the "lethal trifecta": https://simonwillison.net/2025/Jun/16/the-lethal-trifecta/

link

veganmosfet 76 days ago

You're welcome!

My main takeaway message is: models (even opus4.6) do not follow security "instructions" reliably. In OpenClaw, they added security warnings, tags, random IDs... None of these countermeasures work reliably. Even sandboxing can be escaped (not in the classical sense using vulnerabilities, but using multi-layered prompt injection payload with natural language only)[0]. As soon as untrusted content is injected in the context, do not trust any actions downstream.

[0] https://itmeetsot.eu/posts/2026-02-15-openclaw_sandbox/

link

cornholio 75 days ago

What do you think about CaMeL and similar approaches?

https://simonwillison.net/2025/Apr/11/camel/

link

veganmosfet 75 days ago

Good question.

CaMeL is imho safer, but hard to implement into modern agents like OpenClaw. Its core idea is that a privileged LLM plans from the (trusted) user request only, while a restricted interpreter executes that plan (and enforces policies). Untrusted content is parsed separately and is not fed back into the privileged LLM.

Modern agents are useful exactly because they run a feedback loop (observe, reason, adapt, use tools, repeat). CaMeL breaks that loop, which improves security but makes it a poor fit for highly general agents like OpenClaw.

link