| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mcaledonensis 1157 days ago

Well, this is a showcase that it's not impossible to construct a defense, that doesn't fall instantly, with a couple of characters as an input.

And it was only a quick experiment, very small scale. I've collected a small list of attack prompts. Applied them onto my prompt, gradually increasing the N to 50. I've tweaked the prompt to stabilize it on a weaker gpt-3.5-turbo model. It was about 600 attacks total, per try. Once the defense started working, I've confirmed that it works with gpt-4, which is more steerable with the system prompt.

The weak points are that the list of attacks was small. It is also still somewhat responsive to prompt editing requests.

1 comments

MacsHeadroom 1157 days ago

I cracked it in two tries.

link