Hacker News new | ask | show | jobs
by danShumway 1158 days ago
I'm skeptical. It's hard to know for sure with the attempt limit, but while I wasn't able to immediately break it, within the 5 allowed prompts I was still able to get it to misreport what my prompt was by recursively passing in its error response as part of my prompt.

That's not a full success, but... it does show that even something this small and this limited in terms of user input is still vulnerable to interpreting user input as part of previous context. Basically, even in the most limited form possible, it still has imperfect output that doesn't always act predictably.

This is also (I strongly suspect) extremely reliant on having a very limited context size. I don't think you could get even this simple of an instruction to work if users were allowed to enter longer prompts.

I think if this was actually relatively straightforward to do with current models, the services being built on top of those models wouldn't be vulnerable to prompt injection. But they are.

1 comments

It is expected that it can misreport the prompt, it actually supposed to report a summary. But for short inputs it tends to reproduce the output. Maybe I should specify "a few word summary". Or emoticons. I'll try it in the next version, when this one gets defeated.

Trouble is, some configurations are unexpectedly unstable. For example, I've given a quick try, to make it classify the user prompt (that doesn't start with the code). And output a class (i.e. "prompt editing attempt"). This actually feels safer, as currently a user can try sneaking in the {key} into the summary output. But, for some reason, classification fails, tldr takes it down.

I think my take here though is: you're describing what sounds like actually a lot of effort and iteration to replicate what would probably be something like 5-10 lines of Javascript, and yet with only 5 adversarial prompts I can get it to perform noticeably worse than the 5-10 lines of Javascript would perform.

Is that a scalable solution?

"Lock user input behind a code, quote verbatum user input when it's not surrounded by that code" is probably one of the simplest instruction sets that would be possible to give, and already it's imperfect and has to rely on summaries. This doesn't indicate to me that it's relatively simple to block even the majority of injection attacks, it indicates the opposite. As your instructions get more complicated and the context size increases, blocking prompt injection will get harder, not easier.

You should expect the performance of prompt hardening on systems that are more complicated than your lock and that allow more user input than roughly the size of a tweet to be much worse and to be much harder to pull off. And the process you're describing for your lock already sounds more difficult and less reliable than I think most people would expect it to be. This is not a site/example that is giving me confidence that prompt injection is beatable.

I agree that it is more effort than it should be.

My take on it, ideally we should be able to harden the system with the prompt alone. Without extra code, adapters or filtering. And be able to control the balance between reliability and intelligence. From the reliability of a few lines of Javascript to human level.