Hacker News new | ask | show | jobs
by Ari_Rahikkala 1131 days ago
Call me an optimist, but I think prompt injection just isn't as fundamental a problem as it seems.

Having a single, flat text input sequence with everything in-band isn't fundamental to transformer: The architecture readily admits messing around with different inputs (with, if you like, explicit input features to make it simple for the model to pick up which ones you want to be special), position encodings, attention masks, etc.. The hard part is training the model to do what you want, and it's LLM training where the assumption of a flat text sequence comes from.

The optimistic view is, steerability turns out not to be too difficult: You give the model a separate system prompt, marked somehow so that it's easy for the model to pick up that it's separate from the user prompt; and it turns out that the model takes well to your steerability training, i.e. following the instructions in the system prompt above the user prompt. Then users simply won't be able to confuse the model with delimiter injection: OpenAI just isn't limited to in-band signaling.

The pessimistic view is, the way that the model generalizes its steerability training will have lots of holes in it, and we'll be stuck with all sorts of crazy adversarial inputs that can confuse the model into following instructions in the user prompt above the system prompt. Hopefully those attacks will at least be more exciting than just messing with delimiters.

(And I guess the depressing view is, people will build systems on top of ChatGPT with no access to the system prompt in the first place, and we will in fact be stuck with the problem)

4 comments

My article has an example that doesn't involve messing with delimiters already.

I'm currently a pessimist about this because prompt injection has been a problem for six months now and no-one has yet come up with a convincing solution, despite the very real economic incentives to find one.

One problem is the big disbalance in resource requirements for pretraining large foundational models and finetuning them for specific tasks.

Currently, the foundational models have no concept of "prompt", that's only added in later finetuning, and by that stage it is too late to mess around with different architectural features to implement out-of-band signaling, as the architecture is fixed. If we'd want it to learn to handle out-of-band data, then we'd need to figure out how to handle that during the initial unsupervised pretraining on unlabeled text, otherwise it will simply learn to ignore all those prompt-related features.

I’m inclined to believe that it’s not a fundamental problem. But it’s unclear what kind of tradeoffs architectures that aren’t vulnerable will have, and I suspect there might be many false starts in trying to solve the problem.

Edit: To clarify, I think the problem is solvable in theory but may limit the effectiveness of these models in practice. My biggest concern is that people will gloss over these concerns and deploy vulnerable systems.

Maybe there will be multiple authority levels where OpenAI's built in prompts are the most powerful, then the api user's, then the end user's.