| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by gregatragenet3 711 days ago
	This is why I wrote https://github.com/gregretkowski/llmsec . Every LLM system should be evaluating anything coming from a user to gauge its maliciousness.

5 comments

simonw 711 days ago

This approach is flawed because it attempts to use use prompt-injection-susceptible models to detect prompt injection.

It's not hard to imagine prompt injection attacks that would be effective against this prompt for example: https://github.com/gregretkowski/llmsec/blob/fb775c9a1e4a8d1...

It also uses a list of SUS_WORDS that are defined in English, missing the potential for prompt injection attacks to use other languages: https://github.com/gregretkowski/llmsec/blob/fb775c9a1e4a8d1...

I wrote about the general problems with the idea of using LLMs to detect attacks against LLMs here: https://simonwillison.net/2022/Sep/17/prompt-injection-more-...

link

gregatragenet3 711 days ago

Great, I would love to get some of the prompts you have in mind and try them with my library and see the results.

Do you have recommendations on more effective alternatives to prevent prompt attacks?

I don't believe we should just throw up our hands and do nothing. No solution will be perfect, but we should strive to a solution that's better than doing nothing.

link

simonw 711 days ago

“Do you have recommendations on more effective alternatives to prevent prompt attacks?”

I wish I did! I’ve been trying to find good options for nearly two years now.

My current opinion is that prompt injections remain unsolved, and you should design software under the assumption that anyone who can inject more than a sentence or two of tokens into your prompt can gain total control of what comes back in the response.

So the best approach is to limit the blast radius for if something goes wrong: https://simonwillison.net/2023/Dec/20/mitigate-prompt-inject...

“No solution will be perfect, but we should strive to a solution that's better than doing nothing.”

I disagree with that. We need a perfect solution because this is a security vulnerability, with adversarial attackers trying to exploit it.

If we patched SQL injection vulnerability with something that only worked 99% of the time all of our systems would be hacked to pieces!

A solution that isn’t perfect will give people a false sense of security, and will result in them designing and deploying systems that are inherently insecure and cannot be fixed.

link

gregatragenet3 710 days ago

I look at it like antivirus - it's not perfect, and 0-days will sneak by (more-so at first while the defenses are not matured) but it is still better to have it than not.

You do bring up a good point which is what /is/ the effectiveness of these defensive type measures? I just found a benchmarking tool, which I'll use to get a measure on how effective these defenses can actually be - https://github.com/lakeraai/pint-benchmark

link

yifanl 711 days ago

My personal lack of imagination (but I could very much be wrong!) tells me that there's no way to prevent prompt injection without losing the main benefit of accepting prompts as input in the first place - If we could enumerate a known whitelist before shipping, then there's no need for prompts, at most it'd be just mapping natural language to user actions within your app.

link

SahAssar 711 days ago

> It checks these using an LLM which is instructed to score the user's prompt.

You need to seriously reconsider your approach. Another (especially a generic) LLM is not the answer.

link

gregatragenet3 711 days ago

What solution would you recommend then?

link

namaria 710 days ago

Don't graft generative AI on your system? Seems pretty straightforward to me.

link

SahAssar 710 days ago

If you want to defend against prompt injection why would you defend with a tool vulnerable to prompt injection?

I don't know what I would use, but this seems like a bad idea.

link

burkaman 711 days ago

Does your library detect this prompt as malicious?

link

vharuck 711 days ago

Extra LLMs make it harder, but not impossible, to use prompt injection.

In case anyone hasn't played it yet, you can test this theory against Lakera's Gandalf: https://gandalf.lakera.ai/intro

link

yifanl 711 days ago

I'm confused, this is using an LLM to detect if LLM input is sanitized?

But if this secondary LLM is able to detect this, wouldn't the LLM handling the input already be able to detect the malicious input?

link

Matticus_Rex 711 days ago

Even if they're calling the same LLM, LLMs often get worse at doing things or forget some tasks if you give them multiple things to do at once. So if the goal is to detect a malicious input, they need that as the only real task outcome for that prompt, and then you need another call for whatever the actual prompt is for.

But also, I'm skeptical that asking an LLM is the best way (or even a good way) to do malicious input detection.

link