| I wonder if a lot of those "injection" problems could be overcome by introducing a distinction between the different types of input and output already at the token level. E.g. imagine that every token that an LLM inputs or outputs would be associated with a "color" or "channel", which corresponds to the token's source or destination: - "red": tokens input by the user, i.e. the initial prompt and subsequent replies. - "green": answers from the LLM to the user, i.e. everything the user sees as textual output on the screen. - "blue": instructions from the LLM to a plugin: database queries, calculations, web requests, etc. - "yellow": replies from the plugin back to the LLM. - "purple": the initial system prompt. The point is that each (word, color) combination constitutes a separate token; i.e. if your "root" token dictionary was as follows: hello -> 0001;
world -> 0002; then the "colorized" token dictionary would be the cross product of the root and each color combination: hello (red) -> 0001;
hello (green) -> 0002;
...
world (red) -> 0006;
world (green) -> 0007;
... likewise, because the model considers "hello (red)" and "hello (blue)" two different tokens, it also has two different sets of weights for those tokens and hopefully much less risk of confusing one kind of token with the other. With some luck, you don't have to use 5 x the amount of compute and training data for training: You might be able to take an "ordinary" model, trained on non-colored tokens, then copy the weights four times and finetune the resulting "expanded" model on a colored corpus. Likewise, because the model should only ever predict "green" or "blue" tokens, any output neuron that correspond only to "red", "yellow" or "purple" tokens can be removed from the model. |
I've also elaborated on why this problem is harder than one may think in a blogpost: https://medium.com/better-programming/the-dark-side-of-llms-...
It's easy to come up with solutions that seem promising, but so far no one has produced a solution that holds up to adversarial pressure. And indirect prompt injection on integrated LLMs increases the stakes significantly.