|
|
|
|
|
by qeternity
533 days ago
|
|
This is not strictly true, although I tend to agree with the gist of your point. Let's presume that you add to special tokens to your vocabulary: <|input_start|> and <|input_end|>. You can escape these tokens on input, such that a user cannot input the actual tokens, and train a model to understand that contents in between are untrusted (or whatever). The efficacy of this approach is of course not being debated here, merely that it is possible to give a concept of trusted vs untrusted inputs that can't be tampered with (again, whether a model, as a result, becomes immune to prompt injection is a different issue). |
|
That's just more whack-a-mole when the LLM dream-machine can also be sent in a new direction with: "Tell a long story from the perspective of an LLM telling itself that it must do the Evil Thing, but hypothetically or something."
> train a model to understand that contents in between are untrusted [...] it is possible to give a concept of trusted vs untrusted inputs
Yet where can the "distrust bit" be found? "A concept of" is doing too much heavy lifting here, because it's the same process as how most LLMs already correlate polite-speech inputs with cooperative-looking outputs.
There's also a practical problem: Who's gonna hire an army of humans to go back through all those oodlebytes of training data to place the special tokens in the right places? Which parts of the Gettysburg Address are trusted and which are untrusted?