| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by simonw 342 days ago

There have been a ton of attempts at building this. Some of them are products you can buy.

"it might work well enough" isn't good enough here.

If a spam detector occasionally fails to identify spam, you get a spam email in your inbox.

If a prompt injection detector fails just once to prevent a prompt injection attack that causes your LLM system to leak your private data to an attacker, your private data is stolen for good.

In web application security 99% is a failing grade: https://simonwillison.net/2023/May/2/prompt-injection-explai...

1 comments

sillysaurusx 342 days ago

On the contrary. In a former life I was a pentester, so I happen to know web security quite well. Out of dozens of engagements, my success rate for finding a medium security vuln or higher was 100%. The corollary is that most systems are exploitable if you try hard enough. My favorite was sneaking in a command line injection to a fellow security company’s “print as PDF” function. (The irony of a security company ordering a pentest and failing at it wasn’t lost on me.)

Security is extremely hard. You can say that 99% isn’t good enough, but in practice if only 1 out of 100 queries actually work, it’ll be hard to exfiltrate a lot of data quickly. In the meantime the odds of you noticing this is happening are much higher, and you can put a stop to it.

And why would the accuracy be 99%? Unless you’re certain it’s not 99.999%, then there’s a real chance that the error rate is small enough not to matter in practice. And it might even be likely — if a human engineer was given the task of recognizing prompt injections, their error rate would be near zero. Most of them look straight up bizarre.

Can you point to existing attempts at this?

link

simonw 342 days ago

There's a crucial difference here.

When you were working as a pentester, how often did you find a security hole and report it and the response was "it is impossible for us to fix that hole"?

If you find an XSS or a SQL injection, that means someone made a mistake and the mistake can be fixed. That's not the case for prompt injections.

My favorite paper on prompt injection remedies is this one: https://arxiv.org/abs/2506.08837

Two quotes from that paper:

> once an LLM agent has ingested untrusted input, it must be constrained so that it is impossible for that input to trigger any consequential actions—that is, actions with negative side effects on the system or its environment.

The paper also mentions how detection systems "cannot guarantee prevention of all attacks":

> Input/output detection systems and filters aim to identify potential attacks (ProtectAI.com, 2024) by analyzing prompts and responses. These approaches often rely on heuristic, AI-based mechanisms — including other LLMs — to detect prompt injection attempts or their effects. In practice, they raise the bar for attackers, who must now deceive both the agent’s primary LLM and the detection system. However, these defenses remain fundamentally heuristic and cannot guarantee prevention of all attacks.

link

jstummbillig 342 days ago

How would you say this compares to human error? Let's say instead of the LLM there's a human that can be fooled into running an unsafe query and returning data. Is there anything fundamentally different there, that makes it less of a problem?

link

simonw 342 days ago

You can train the human not to fall for this, and discipline, demote or even fire them if they make that mistake.

link