|
|
|
|
|
by simonw
1157 days ago
|
|
I've seen a lot of solutions that look like this in the past: they all break eventually, usually when the attack prompt is longer and can spend more tokens over-coming the initial rules defined in the earlier prompt. I bet you could break the GPT-4 version yourself if you kept on trying different attacks. Often one that works well in my experience is imitating a sequence of prompts from the user and the assistant, as I did in the example here: https://simonwillison.net/2023/Apr/14/worst-that-can-happen/... |
|