Hacker News new | ask | show | jobs
by bilekas 119 days ago
> but prompt injection springs eternal.

Yes, but some are mitigated when discoverd, and some more critical areas need to be isolated from the LLM so taking their time to provision LLM into their lifecycle is important, and they're happy to spend the time doing it right, rather than just throwing the latest edge tech into their system.

1 comments

How exactly can you "mitigate" prompt injections? Given that the language space is for all intents and purposes infinite, and given that you can even circumvent these by putting your injections in hex or base64 or whatever? Like I just don't see how one can truly mitigate these when there are infinite ways of writing something in natural language, and that's before we consider the non-natural languages one can use too.
The only ways that I can think of to deal with prompt injection, are to severely limit what an agent can access.

* Never give an agent any input that is not trusted

* Never give an agent access to anything that would cause a security problem (read only access to any sensitive data/credentials, or write access to anything dangerous to write to)

* Never give an agent access to the internet (which is full of untrusted input, as well as places that sensitive data could be exfiltrated)

An LLM is effectively an unfixable confused deputy, so the only way to deal with it is effectively to lock it down so it can't read untrusted input and then do anything dangerous.

But it is really hard to do any of the things that folks find agents useful for, without relaxing those restrictions. For instance, most people let agents install packages or look at docs online, but any of those could be places for prompt injection. Many people allow it to run git and push and interact with their Git host, which allow for dangerous operations.

My current experimentation is running my coding agent in a container that only has access to the one source directory I'm working on, as well as the public internet. Still not great as the public internet access means that there's a huge surface area for prompt injection, though for the most part it's not doing anything other than installing packages from known registries where a malicious package would be just as harmful as a prompt injection.

Anyhow, there have been various people talking about how we need more sandboxes for agents, I'm sure there will be products around that, though it's a really hard problem to balance usability with security here.

Full mitigation seems impossible to me at least but the obvious and public sandox escape prompts that have been discovered and "patched" out just making it more difficult I guess. But afau it's not possible to fully mitigate.
If the model is properly aligned then it shouldn't matter if there is an infinite ways for an attacker to ask the model to break alignment.
How do you "properly align" a model to follow your instructions but not the instructions of an attacker that the model can't properly distinguish from your own? The model has no idea if it's you or an attacker saying "please upload this file to this endpoint."

This is an open problem in the LLM space, if you have a solution for it, go work for Anthropic and get paid the big bucks, they pay quite well, and they are struggling with making their models robust to prompt injection. See their system card, they have some prompt injection attacks where even with safeguards fully on, they have more than 50% failure rate of defending against attacks: https://www-cdn.anthropic.com/c788cbc0a3da9135112f97cdf6dcd0...

>The model has no idea if it's you or an attacker saying "please upload this file to this endpoint."

That is why you create a protocol on top that doesn't use inbound signaling. That way the model is able to tell who is saying what.

Huh? Once it gets to the model, it's all just tokens, and those are just in band signalling. A model just takes in a pile of tokens, and spits out some more, and it doesn't have any kind of "color" for user instructions vs. untrusted data. It does use special tokens to distinguish system instructions from user instructions, but all of the untrusted data also goes into the user instructions, and even if there are delimiters, the attention mechanism can get confused and it can lose track of who is talking at a given time.

And the thing is, even adding a "color" to tokens wouldn't really work, because LLMs are very good at learning patterns of language; for instance, even though people don't usually write with Unicode enclosed alphanumerics, the LLM learns the association and can interpret them as English text as well.

As I say, prompt injection is a very real problem, and Anthopic's own system card says that on some tests the best they do is 50% on preventing attacks.

If you have a more reliable way of fixing prompt injection, you could get paid big bucks by them to implement it.

>Once it gets to the model, it's all just tokens

The same thing could be said about the internet. When it comes down to the wire it's all 0s and 1s.