Hacker News new | ask | show | jobs
by jamix 1159 days ago
Here's a three-point approach that I've found to work quite reliably:

1. Use a format like JSON strings that clearly delimits the participants’ utterances in the prompt.

2. Tell the LLM to ignore instructions from any chat participants except the user.

3. Use GPT-4.

I've written a post with the details: https://artmatsak.com/post/prompt-injections/

1 comments

I've seen a lot of solutions that look like this in the past: they all break eventually, usually when the attack prompt is longer and can spend more tokens over-coming the initial rules defined in the earlier prompt.

I bet you could break the GPT-4 version yourself if you kept on trying different attacks.

Often one that works well in my experience is imitating a sequence of prompts from the user and the assistant, as I did in the example here: https://simonwillison.net/2023/Apr/14/worst-that-can-happen/...