| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by simonw 342 days ago

There's a crucial difference here.

When you were working as a pentester, how often did you find a security hole and report it and the response was "it is impossible for us to fix that hole"?

If you find an XSS or a SQL injection, that means someone made a mistake and the mistake can be fixed. That's not the case for prompt injections.

My favorite paper on prompt injection remedies is this one: https://arxiv.org/abs/2506.08837

Two quotes from that paper:

> once an LLM agent has ingested untrusted input, it must be constrained so that it is impossible for that input to trigger any consequential actions—that is, actions with negative side effects on the system or its environment.

The paper also mentions how detection systems "cannot guarantee prevention of all attacks":

> Input/output detection systems and filters aim to identify potential attacks (ProtectAI.com, 2024) by analyzing prompts and responses. These approaches often rely on heuristic, AI-based mechanisms — including other LLMs — to detect prompt injection attempts or their effects. In practice, they raise the bar for attackers, who must now deceive both the agent’s primary LLM and the detection system. However, these defenses remain fundamentally heuristic and cannot guarantee prevention of all attacks.

1 comments

jstummbillig 342 days ago

How would you say this compares to human error? Let's say instead of the LLM there's a human that can be fooled into running an unsafe query and returning data. Is there anything fundamentally different there, that makes it less of a problem?

link

simonw 342 days ago

You can train the human not to fall for this, and discipline, demote or even fire them if they make that mistake.

link