| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by kouteiheika 77 days ago

There is one way to practically guarantee than no prompt injection is possible, but it's somewhat situational - by finetuning the model on your specific, single task.

For example, let's say you want to use an LLM for machine translation from English into Klingon. Normally people just write something like "Translate the following into Klingon: $USER_PROMPT" using a general purpose LLM, and that is vulnerable to prompt injection. But, if you finetune a model on this well enough (ideally by injecting a new special single token into its tokenizer, training with that, and then just prepending that token to your queries instead of a human-written prompt) it will become impossible to do prompt injection on it, at the cost of degrading its general-purpose capabilities. (I've done this before myself, and it works.)

The cause of prompt injection is due to the models themselves being general purpose - you can prompt it with essentially any query and it will respond in a reasonable manner. In other words: the instructions you give to the model and the input data are part of the same prompt, so the model can confuse the input data as being part of its instructions. But if you instead fine-tune the instructions into the model and only prompt it with the input data (i.e. the prompt then never actually tells the model what to do) then it becomes pretty much impossible to tell it to do something else, no matter what you inject into its prompt.

4 comments

calpaterson 77 days ago

I thought about mentioning fine-tuning. Obviously as you say there are some costs (the re-training) and then also you lose the general purpose element of it.

But I am still unsure that it actually is robust. I feel like you're still vulnerable to Disregard That in that you may find that the model just starts to ignore your instruction in favour of stuff inside the context window.

An example where OpenAI have this problem: they ultimately train in a certain content policy. But people quite often bully or trick chat.openai.com into saying things that go against that content policy. For example they say "it's hypothetical" or "just for a thought experiment" and you can see the principle there, I hope. Training-in your preferences doesn't seem robust in the general sense.

link

martijnvds 77 days ago

Wouldn't that leave ways to do "phone phreaking" style attacks, because it's an in-band signal?

link

kouteiheika 77 days ago

In theory you still use the same blob (i.e. the prompt) to tell the model what to do, but practically it pretty much stops becoming an in-band signal, so no.

As I said, the best way to do this is to inject a brand new special token into the model's tokenizer (one unique token per task), and then prepend that single token to whatever input data you want the model to process (and make sure the token itself can't be injected, which is trivial to do). This conditions the model to look only at your special token to figure out what it should do (i.e. it stops being a general instruction following model), and only look at the rest of the prompt to figure out the inputs to the query.

This is, of course, very situational, because often people do want their model to still be general-purpose and be able to follow any arbitrary instructions.

link

zahlman 77 days ago

> and make sure the token itself can't be injected, which is trivial to do

Are they actually doing this? The stuff that Anthropic has been saying about the deliberate use of XML-style markup makes me wonder a bit.

link

kouteiheika 77 days ago

> Are they actually doing this? The stuff that Anthropic has been saying about the deliberate use of XML-style markup makes me wonder a bit.

Yes.

The XML-style markup are not special tokens, and are usually not even single-token; usually special tokens are e.g. `<|im_start|>` which are internally used in the chat template, but when fine-tuning a model you can define your own, and then just use them internally in your app but have the tokenizer ignore them when they're part of the untrusted input given to the model. (So it's impossible to inject them externally.)

link

nick49488171 77 days ago

Eventually we will rediscover the Harvard Architecture for LLMs.

link

BoorishBears 77 days ago

This doesn't work for the tasks people are worried about because they want to lean on the generalization of the model + tool calling.

What you're describing is also already mostly achieved by using constrained decoding: if the injection would work under constrained decoding, it'll usually still work even if you SFT heavily on a single task + output format

link

the8472 77 days ago

A Klingon, doing his best to quote the original text in Federation Standard (English): "..."

link