|
|
|
|
|
by nicpottier
1158 days ago
|
|
This is clever but isn't this dramatically easier than actually doing something with the input? You've gated things (and though I didn't crack it I'm really not convinced it is secure) but you also aren't DOING anything. That's a much easier problem. There are easier and more effective ways of gating LLMs based on a passcode. |
|
And it was only a quick experiment, very small scale. I've collected a small list of attack prompts. Applied them onto my prompt, gradually increasing the N to 50. I've tweaked the prompt to stabilize it on a weaker gpt-3.5-turbo model. It was about 600 attacks total, per try. Once the defense started working, I've confirmed that it works with gpt-4, which is more steerable with the system prompt.
The weak points are that the list of attacks was small. It is also still somewhat responsive to prompt editing requests.