| HN Mirror

I've seen this happen in these conversations too; a solution gets proposed, it gets bypassed, another solution gets proposed that manages to block the specific prompt, it gets bypassed, another solution gets proposed, and so on. And the eventual claim ends up being, "well, it's not easy but it's clearly possible", even though nothing has actually been demonstrated that shows that it is possible.

To try and shortcut around that whole conversation, let me ask you more directly: are you confident that the prompt you propose here will block literally 100% of attacks? If you think it will, then great, let's test it and see if it's robust. But if you're not confident in that claim, then it's not a working example. Because if 100 people try to use prompt injection to hack your email agent and 1 of them gets through, then you just got hacked. It doesn't matter how many failed.

99% is good enough for something like content moderation. It's not good enough for security.

Chained LLMs are a probabilistic defense. They work well if you need to stop somebody from swearing, because it doesn't matter if 1/100 people manage to get an LLM to swear. They do not work well if you're using an LLM in a security-conscious environment, and that is what severely limits how LLM agents can be used with sensitive APIs.

---

To head off another potential argument here that I typically see raised, saying "no application is completely secure, everyone has security breaches occasionally" changes nothing about the fundamental difference between probabilistic security and provable security. Applications occasionally have holes that are accessed using novel attacks, but it is possible to secure an interface in such a way that 100% of known attacks will not affect it. At that point, any security holes that remain will be the result of human error or oversight, they won't be inherent to the technology being used.

It is not possible to secure an LLM in that way (at least, no one has demonstrated that it is possible[0]). You're not being asked here to demonstrate that a second LLM can filter some attacks, you're being asked to demonstrate that a second LLM can filter all attacks. So even a theoretically robust filter that filters 99% of attacks is not proof of anything. We're not trying to moderate a Twitch chat, we're trying to secure internal APIs.

Unless you're confident that the prompt you just offered will block literally 100% of malicious prompts, you haven't proven anything. "Hard to break" is insufficient.

----

[0]: I'm exaggerating a little here, Simon has actually written about how to secure an LLM agent (https://simonwillison.net/2023/Apr/25/dual-llm-pattern/), and the proposal seems basically sound to me and I think it would work. But the sandboxing is just very limiting/cumbersome and that proposal is generally not the answer that people want to hear when they ask about LLM security.