|
|
|
|
|
by mcaledonensis
1168 days ago
|
|
It is expected that it can misreport the prompt, it actually supposed to report a summary. But for short inputs it tends to reproduce the output. Maybe I should specify "a few word summary". Or emoticons. I'll try it in the next version, when this one gets defeated. Trouble is, some configurations are unexpectedly unstable. For example, I've given a quick try, to make it classify the user prompt (that doesn't start with the code). And output a class (i.e. "prompt editing attempt"). This actually feels safer, as currently a user can try sneaking in the {key} into the summary output. But, for some reason, classification fails, tldr takes it down. |
|
Is that a scalable solution?
"Lock user input behind a code, quote verbatum user input when it's not surrounded by that code" is probably one of the simplest instruction sets that would be possible to give, and already it's imperfect and has to rely on summaries. This doesn't indicate to me that it's relatively simple to block even the majority of injection attacks, it indicates the opposite. As your instructions get more complicated and the context size increases, blocking prompt injection will get harder, not easier.
You should expect the performance of prompt hardening on systems that are more complicated than your lock and that allow more user input than roughly the size of a tweet to be much worse and to be much harder to pull off. And the process you're describing for your lock already sounds more difficult and less reliable than I think most people would expect it to be. This is not a site/example that is giving me confidence that prompt injection is beatable.