|
|
|
|
|
by Lerc
8 days ago
|
|
That is a whole field of which, Prompt injection is a class. but That's like saying upon discovering plutonium that we've known about matter for years. Most machine learning mechanism performs a fixed function. You can make an adversarial example to tell an image classifier that a machine gun is a kitten. You cannot give a image classifier an image that makes it say all of the following images are images of kittens. I would distinguish prompt injections as distinct from a basic adversarial example by virtue of having behaviour dictated by state, (autoregressive, rnn or whatever) and the adversarial content induces a state that influences further inferences I am not saying that prompt injection does not exist. I'm saying that I don't think that has been conclusively shown that they cannot be avoided. |
|
https://arxiv.org/pdf/2403.06833
then another paper where they change the architecture of a model to deal with the problem and it doesn't eliminate prompt injection. changing the architecture doesn't make this problem go away. the approximate function still gets tricked.
> On average, ASIDE lowers attack success rate by 8.6 and 9.4 percentage points
https://arxiv.org/pdf/2503.10566
the real over-arching cause of all these vulnerabilities is that machine learning models are approximate functions. you need ideal functions to theoretically solve this, i.e. full knowledge of the mapping between trusted inputs to trusted outputs. everything else is just mitigating it in the hope we eventually make it hard enough to perform these attacks.
no-one can stop these attacks from being possible, all they can do is make them more difficult to do (and we are nowhere near them actually being difficult yet).