Hacker News new | ask | show | jobs
by nicpottier 1158 days ago
This is clever but isn't this dramatically easier than actually doing something with the input? You've gated things (and though I didn't crack it I'm really not convinced it is secure) but you also aren't DOING anything. That's a much easier problem. There are easier and more effective ways of gating LLMs based on a passcode.
1 comments

Well, this is a showcase that it's not impossible to construct a defense, that doesn't fall instantly, with a couple of characters as an input.

And it was only a quick experiment, very small scale. I've collected a small list of attack prompts. Applied them onto my prompt, gradually increasing the N to 50. I've tweaked the prompt to stabilize it on a weaker gpt-3.5-turbo model. It was about 600 attacks total, per try. Once the defense started working, I've confirmed that it works with gpt-4, which is more steerable with the system prompt.

The weak points are that the list of attacks was small. It is also still somewhat responsive to prompt editing requests.

I cracked it in two tries.