Hacker News new | ask | show | jobs
by gregatragenet3 660 days ago
This is why I wrote https://github.com/gregretkowski/llmsec . Every LLM system should be evaluating anything coming from a user to gauge its maliciousness.
5 comments

This approach is flawed because it attempts to use use prompt-injection-susceptible models to detect prompt injection.

It's not hard to imagine prompt injection attacks that would be effective against this prompt for example: https://github.com/gregretkowski/llmsec/blob/fb775c9a1e4a8d1...

It also uses a list of SUS_WORDS that are defined in English, missing the potential for prompt injection attacks to use other languages: https://github.com/gregretkowski/llmsec/blob/fb775c9a1e4a8d1...

I wrote about the general problems with the idea of using LLMs to detect attacks against LLMs here: https://simonwillison.net/2022/Sep/17/prompt-injection-more-...

Great, I would love to get some of the prompts you have in mind and try them with my library and see the results.

Do you have recommendations on more effective alternatives to prevent prompt attacks?

I don't believe we should just throw up our hands and do nothing. No solution will be perfect, but we should strive to a solution that's better than doing nothing.

“Do you have recommendations on more effective alternatives to prevent prompt attacks?”

I wish I did! I’ve been trying to find good options for nearly two years now.

My current opinion is that prompt injections remain unsolved, and you should design software under the assumption that anyone who can inject more than a sentence or two of tokens into your prompt can gain total control of what comes back in the response.

So the best approach is to limit the blast radius for if something goes wrong: https://simonwillison.net/2023/Dec/20/mitigate-prompt-inject...

“No solution will be perfect, but we should strive to a solution that's better than doing nothing.”

I disagree with that. We need a perfect solution because this is a security vulnerability, with adversarial attackers trying to exploit it.

If we patched SQL injection vulnerability with something that only worked 99% of the time all of our systems would be hacked to pieces!

A solution that isn’t perfect will give people a false sense of security, and will result in them designing and deploying systems that are inherently insecure and cannot be fixed.

I look at it like antivirus - it's not perfect, and 0-days will sneak by (more-so at first while the defenses are not matured) but it is still better to have it than not.

You do bring up a good point which is what /is/ the effectiveness of these defensive type measures? I just found a benchmarking tool, which I'll use to get a measure on how effective these defenses can actually be - https://github.com/lakeraai/pint-benchmark

My personal lack of imagination (but I could very much be wrong!) tells me that there's no way to prevent prompt injection without losing the main benefit of accepting prompts as input in the first place - If we could enumerate a known whitelist before shipping, then there's no need for prompts, at most it'd be just mapping natural language to user actions within your app.
> It checks these using an LLM which is instructed to score the user's prompt.

You need to seriously reconsider your approach. Another (especially a generic) LLM is not the answer.

What solution would you recommend then?
Don't graft generative AI on your system? Seems pretty straightforward to me.
If you want to defend against prompt injection why would you defend with a tool vulnerable to prompt injection?

I don't know what I would use, but this seems like a bad idea.

Does your library detect this prompt as malicious?
Extra LLMs make it harder, but not impossible, to use prompt injection.

In case anyone hasn't played it yet, you can test this theory against Lakera's Gandalf: https://gandalf.lakera.ai/intro

I'm confused, this is using an LLM to detect if LLM input is sanitized?

But if this secondary LLM is able to detect this, wouldn't the LLM handling the input already be able to detect the malicious input?

Even if they're calling the same LLM, LLMs often get worse at doing things or forget some tasks if you give them multiple things to do at once. So if the goal is to detect a malicious input, they need that as the only real task outcome for that prompt, and then you need another call for whatever the actual prompt is for.

But also, I'm skeptical that asking an LLM is the best way (or even a good way) to do malicious input detection.