|
|
|
|
|
by red75prime
423 days ago
|
|
> Lack of it is the very thing that makes LLMs general-purpose tools and able to handle natural language so well. I wouldn't be so sure. LLMs' instruction following functionality requires additional training. And there are papers that demonstrate that a model can be trained to follow specifically marked instructions. The rest is a matter of input sanitization. I guess it's not a 100% effective, but it's something. For example " The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions " by Eric Wallace et al. |
|
That's the problem: in the context of security, not being 100% effective is a failure.
If the ways we prevented XSS or SQL injection attacks against our apps only worked 99% of the time, those apps would all be hacked to pieces.
The job of an adversarial attacker is to find the 1% of attacks that work.
The instruction hierarchy is a great example: it doesn't solve the prompt injection class of attacks against LLM applications because it can still be subverted.