| HN Mirror

I think my take here though is: you're describing what sounds like actually a lot of effort and iteration to replicate what would probably be something like 5-10 lines of Javascript, and yet with only 5 adversarial prompts I can get it to perform noticeably worse than the 5-10 lines of Javascript would perform.

Is that a scalable solution?

"Lock user input behind a code, quote verbatum user input when it's not surrounded by that code" is probably one of the simplest instruction sets that would be possible to give, and already it's imperfect and has to rely on summaries. This doesn't indicate to me that it's relatively simple to block even the majority of injection attacks, it indicates the opposite. As your instructions get more complicated and the context size increases, blocking prompt injection will get harder, not easier.

You should expect the performance of prompt hardening on systems that are more complicated than your lock and that allow more user input than roughly the size of a tweet to be much worse and to be much harder to pull off. And the process you're describing for your lock already sounds more difficult and less reliable than I think most people would expect it to be. This is not a site/example that is giving me confidence that prompt injection is beatable.