|
|
|
|
|
by simonw
342 days ago
|
|
There have been a ton of attempts at building this. Some of them are products you can buy. "it might work well enough" isn't good enough here. If a spam detector occasionally fails to identify spam, you get a spam email in your inbox. If a prompt injection detector fails just once to prevent a prompt injection attack that causes your LLM system to leak your private data to an attacker, your private data is stolen for good. In web application security 99% is a failing grade: https://simonwillison.net/2023/May/2/prompt-injection-explai... |
|
Security is extremely hard. You can say that 99% isn’t good enough, but in practice if only 1 out of 100 queries actually work, it’ll be hard to exfiltrate a lot of data quickly. In the meantime the odds of you noticing this is happening are much higher, and you can put a stop to it.
And why would the accuracy be 99%? Unless you’re certain it’s not 99.999%, then there’s a real chance that the error rate is small enough not to matter in practice. And it might even be likely — if a human engineer was given the task of recognizing prompt injections, their error rate would be near zero. Most of them look straight up bizarre.
Can you point to existing attempts at this?