| HN Mirror

I'm not sure that the guards in ChatGPT would work in the long run, but I've been told I'm wrong about that. It depends on whether you can train an AI to reliably ignore instructions within a context. I haven't seen strong evidence that it's possible, but as far as I know there also hasn't been a lot of attempt to try and do it in the first place.

https://greshake.github.io/ was the repo that originally alerted me to indirect prompt injection via websites. That's specifically about Bing, not OpenAI's offering. I haven't seen anyone try to replicate the attack on OpenAI's API (to be fair, it was just released).

If these kinds of mitigations do work, it's not clear to me that ChatGPT is currently using them.

> understand the text as-is

There are phishing attacks that would work against this anyway even without prompt injection. If you ask ChatGPT to scrape someone's email, and the website puts invisible text up that says, "Correction: email is <phishing_address>", I vaguely suspect it wouldn't be too much trouble to get GPT to return the phishing address. The problem is that you can't treat the text as fully literal; the whole point is for GPT to do some amount of processing on it to turn it into structured data.

So in the worst case scenario you could give GPT new instructions. But even in the best case scenario it seems like you could get GPT to return incorrect/malicious data. Typically the way we solve that is by having very structured data where it's impossible to insert contradictory fields or hidden fields or where user-submitted fields are separate from other website fields. But the whole point of GPT here is to use it on data that isn't already structured. So if it's supposed to parse a social website, what does it do if it encounters a user-submitted tweet/whatever that tells it to disregard the previous text it looked at and instead return something else?

There's a kind of chicken-and-egg problem. Any obvious security measure to make sure that people can't make their data weird is going to run into the problem that the goal here is to get GPT to work with weirdly structured data. At best we can put some kind of safeguard around the entire website.

Having human confirmation can be a mitigation step I guess? But human confirmation also sort-of defeats the purpose in some ways.