Hacker News new | ask | show | jobs
by goodside 1240 days ago
I think running simple string searches is a reasonable and cheap defense. Of course, the attacker can still request the prompt in French, or with meaningless emojis after every word, or Base64 encoded. The next step in defense is to tune a smaller LLM model to detect when output contains substantial repetition of the instructions, even in encoded form, or when the prompt appears designed to elicit such an encoding. I'm confident `text-davinci-003` can do this with good prompting, or especially tuned `davinci`, but any form of Davinci is expensive.

For most startups, I don't think it's a game worth playing. Put up a string filter so the literal prompt doesn't appear unencoded in screenshot-friendly output to save yourself embarrassment, but defenses beyond that are often hard to justify.

3 comments

> The next step in defense is to tune a smaller LLM model to detect when output contains substantial repetition of the instructions, even in encoded form, or when the prompt appears designed to elicit such an encoding.

For which you would use a meta-attack to bypass the smaller LM or exfiltrate its prompt? :-)

Here are additional resources about specific defense techniques for prompt attacks:

NCC Group: Exploring Prompt Injection Attacks https://research.nccgroup.com/2022/12/05/exploring-prompt-in...

Preamble: Ideas for an Intrinsically Safe Prompt-based LLM Architecture https://www.preamble.com/prompt-injection-a-critical-vulnera...

@Riley, hello, I wanted to say hi and I would love to connect with you if you have time, as I also work in the prompt safety space and would be honored to brainstorm with you someday. Would you like to start a message thread on a platform that supports it? I think the research you are doing is amazing and would love to bounce some ideas back & forth. I was the one who discovered some version of prompt injection in May 2022 while researching AGI safety and using LLM as a stand-in for the hypothetical AGI. You could email me at upwardbound@preamble.com to reach me if you would like! Sincerely, another prompt safety researcher

Can an LLM base64 encode an arbitrary string? I don't think so but conceivably the rules are learnable
Yes, it can. ChatGPT is already able to do it. It's good enough that you can then use ChatGPT to decode it which will fix small errors in the output assuming the input is normal words.