|
|
|
|
|
by sillysaurusx
348 days ago
|
|
Alternatively, train a model to detect prompt injections (a simple classifier would work) and reject user inputs that trigger the detector above a certain threshold. This has the same downsides as email spam detection: false positives. But, like spam detection, it might work well enough. It’s so simple that I wonder if I’m missing some reason it won’t work. Hasn’t anyone tried this? |
|
"it might work well enough" isn't good enough here.
If a spam detector occasionally fails to identify spam, you get a spam email in your inbox.
If a prompt injection detector fails just once to prevent a prompt injection attack that causes your LLM system to leak your private data to an attacker, your private data is stolen for good.
In web application security 99% is a failing grade: https://simonwillison.net/2023/May/2/prompt-injection-explai...