| HN Mirror

I've seen a lot of solutions that look like this in the past: they all break eventually, usually when the attack prompt is longer and can spend more tokens over-coming the initial rules defined in the earlier prompt.

I bet you could break the GPT-4 version yourself if you kept on trying different attacks.

Often one that works well in my experience is imitating a sequence of prompts from the user and the assistant, as I did in the example here: https://simonwillison.net/2023/Apr/14/worst-that-can-happen/...