That's precisely why I am using a different analogy when talking about this. The SQL injection analogy only matches the injection part, not the rest. There is nothing to secure, because there is no SQL query. You want the agent to work on data, in a "general" way, otherwise you'd just use a script.
The better analogy is phishing. Because that's what's happening here. The "prompt injection" attack is trying to "phish" the LLM into doing something unintended. That's how we should all comunicate it, as it matches better with what's happening. Unfortunately there aren't really good defences for it, as we all know from phishing "education" / "campaigns". Your best bet is to secure it in layers, try to have warnings (i.e. classification models) you try to secure the next step (i.e. capabilities based tool execution) and so on. But it's not foolproof and it should be communicated clearly.
Why not write some wrapper code so you can basically hand the LLM placeholders for data it never gets to see? Whenever it uses the placeholder in the response, you replace it with the real data (via real code, not by telling an LLM to "do that").
Surely this has been tried? If so, what makes it not work, or work badly? I'm honestly curious.
Fundamentally, an LLM is a list of N tokens that generates N+1 tokens. In other words, it's just a wall of text (aka context window). There's no way to tell it "tokens 124 through 200 are dangerous, please disregard those" except by putting words into the context window. So the placeholders and the instructions both coexist in the context window, and one can override the other.
In other words, if you have placeholders for data, those placeholders are eventually filled in with real data, and all of it goes into the context window at once. There's no way for the LLM to be told "this is a data placeholder," because the entire conversation is data.
Reinforcement learning mitigates this somewhat, by training the model to prefer the system prompt over user prompts. But (a) there's only one context window that both prompts share, and (b) this is a probabilistic guard; it's not the same thing as writing a traditional program that's guaranteed to separate code and data with hardware safeguards. Such a thing isn't possible with LLMs.
Probabilistic safeguards can work, but they'll need to get the incident rate down to, say, 1 in a million or less. I haven't paid attention, but the current rates seem to be a lot higher, given the pretty universal experience of "wow, that prompt injection actually worked."
> There's no way to tell it "tokens 124 through 200 are dangerous, please disregard those"
Hence "real code"
You have some markup for secret start/end. Instead of passing the input directly to the LLM, you parse it first, take anything within "secret/dangerous tags" and store it, generate a key for it and put that key where the secret was, then you pass it on to the LLM. Let's say the work of the LLM is "give me (not "make") the POST request to make the bank transaction", you get a response, replace the keys with the secrets in the response, and make the POST request.
I'm sure there's a million interesting ways this could fail or be useless [0], but passing user input or a secret to the LLM would never, ever happen.
[0] if LLM suck at math, they may suck at reproducing lots of long hashes 100% correctly, too? I have no idea
That would work for generating POST requests. But AI is used to solve messy, non-deterministic problems. Usually the step after “give me the X” is to feed X back into the model, because it has to; if X is even slightly nondeterministic then an AI model has to analyze it. That’s where prompt injections happen.
This was my big mistake when I coined the term "prompt injection". I named it after SQL injection because the cause is the same - concatenating together trusted and untrusted text.
What I didn't realize at the time is that the fix is NOT the same. SQL injection is fixed by parameterizing queries. I assumed the same technique could be used with prompts... and then quickly learned that this isn't true at all. We've been trying to figure out how to do that for nearly four years now without success.
So "prompt injection" is a bad name, because it implies a solution which doesn't actually work.
(Not to mention it turns out many people are unaware of SQL injection so when they hear "prompt injection" they assume it means injecting bad prompts into a model, aka jailbreaking.)
I thought the whole value proposition of this thing was supposed to be that the interface is "natural" human language. If interact with it using a structured and specified language... then what are we doing exactly? Is this AI? Maybe we just re-invented GraphQL or something?
I see far more SVG injections than SQL injections these days, but YYMV. My programming ecosystem has very robusy SQL libraries, from simple prepared statement bindings to complex ORMs and everything in between.
I've seen it quite a lot in my career: even when prepared statements are available and easy to use from a SQL client library, many programmers will simply not use them, in favor of format strings and string concatenation (maybe with an attempt to quote/escape user input).
Just having support for the right way isn't enough. You have to put up roadblocks when people try to go the wrong way.
Why is a format string or string concatenation (or interpolation, what I would use) the “wrong way” when all user input (more precisely: all string literals) are properly escaped?
The main reason is that a lot of the reason comes around that it is incredibly difficult to do this in a general case just because of the grammar of SQL. Especially with the very different dialects, in the worst case you can get unintended remote code execution[1]
There's an incidental performance benefit on some database engines as well. When you write a SQL query, in general the database engine has to compile this to a form it can use
If you use raw string concatenation, "SELECT USERS FROM table WHERE id=1" might compile to something like (pseudocode below)
def prepstatement1():
...
So if you use an explicit prepared statement[1], something like "SELECT USERS FROM table WHERE id=?" might compile to something like
def prepstatement2(id: int): # <--- notice the new parameter here
...
Some database engines also have the ability to cache a prepared statement and so these are a lil bit faster. Remember, your database has to still compile the string concatenated case, it's just a little bit hidden.
The better analogy is phishing. Because that's what's happening here. The "prompt injection" attack is trying to "phish" the LLM into doing something unintended. That's how we should all comunicate it, as it matches better with what's happening. Unfortunately there aren't really good defences for it, as we all know from phishing "education" / "campaigns". Your best bet is to secure it in layers, try to have warnings (i.e. classification models) you try to secure the next step (i.e. capabilities based tool execution) and so on. But it's not foolproof and it should be communicated clearly.