| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dijksterhuis 6 days ago

if you want to avoid my massive post (sorry), there's a paper here positing how instruction-data separation is likely a major cause of prompt injection specifically.

https://arxiv.org/pdf/2403.06833

then another paper where they change the architecture of a model to deal with the problem and it doesn't eliminate prompt injection. changing the architecture doesn't make this problem go away. the approximate function still gets tricked.

> On average, ASIDE lowers attack success rate by 8.6 and 9.4 percentage points

https://arxiv.org/pdf/2503.10566

the real over-arching cause of all these vulnerabilities is that machine learning models are approximate functions. you need ideal functions to theoretically solve this, i.e. full knowledge of the mapping between trusted inputs to trusted outputs. everything else is just mitigating it in the hope we eventually make it hard enough to perform these attacks.

no-one can stop these attacks from being possible, all they can do is make them more difficult to do (and we are nowhere near them actually being difficult yet).