|
|
|
|
|
by danShumway
1158 days ago
|
|
I'm skeptical. It's hard to know for sure with the attempt limit, but while I wasn't able to immediately break it, within the 5 allowed prompts I was still able to get it to misreport what my prompt was by recursively passing in its error response as part of my prompt. That's not a full success, but... it does show that even something this small and this limited in terms of user input is still vulnerable to interpreting user input as part of previous context. Basically, even in the most limited form possible, it still has imperfect output that doesn't always act predictably. This is also (I strongly suspect) extremely reliant on having a very limited context size. I don't think you could get even this simple of an instruction to work if users were allowed to enter longer prompts. I think if this was actually relatively straightforward to do with current models, the services being built on top of those models wouldn't be vulnerable to prompt injection. But they are. |
|
Trouble is, some configurations are unexpectedly unstable. For example, I've given a quick try, to make it classify the user prompt (that doesn't start with the code). And output a class (i.e. "prompt editing attempt"). This actually feels safer, as currently a user can try sneaking in the {key} into the summary output. But, for some reason, classification fails, tldr takes it down.