|
Call me an optimist, but I think prompt injection just isn't as fundamental a problem as it seems. Having a single, flat text input sequence with everything in-band isn't fundamental to transformer: The architecture readily admits messing around with different inputs (with, if you like, explicit input features to make it simple for the model to pick up which ones you want to be special), position encodings, attention masks, etc.. The hard part is training the model to do what you want, and it's LLM training where the assumption of a flat text sequence comes from. The optimistic view is, steerability turns out not to be too difficult: You give the model a separate system prompt, marked somehow so that it's easy for the model to pick up that it's separate from the user prompt; and it turns out that the model takes well to your steerability training, i.e. following the instructions in the system prompt above the user prompt. Then users simply won't be able to confuse the model with delimiter injection: OpenAI just isn't limited to in-band signaling. The pessimistic view is, the way that the model generalizes its steerability training will have lots of holes in it, and we'll be stuck with all sorts of crazy adversarial inputs that can confuse the model into following instructions in the user prompt above the system prompt. Hopefully those attacks will at least be more exciting than just messing with delimiters. (And I guess the depressing view is, people will build systems on top of ChatGPT with no access to the system prompt in the first place, and we will in fact be stuck with the problem) |
I'm currently a pessimist about this because prompt injection has been a problem for six months now and no-one has yet come up with a convincing solution, despite the very real economic incentives to find one.