Hacker News new | ask | show | jobs
by mcaledonensis 1168 days ago
It is expected that it can misreport the prompt, it actually supposed to report a summary. But for short inputs it tends to reproduce the output. Maybe I should specify "a few word summary". Or emoticons. I'll try it in the next version, when this one gets defeated.

Trouble is, some configurations are unexpectedly unstable. For example, I've given a quick try, to make it classify the user prompt (that doesn't start with the code). And output a class (i.e. "prompt editing attempt"). This actually feels safer, as currently a user can try sneaking in the {key} into the summary output. But, for some reason, classification fails, tldr takes it down.

1 comments

I think my take here though is: you're describing what sounds like actually a lot of effort and iteration to replicate what would probably be something like 5-10 lines of Javascript, and yet with only 5 adversarial prompts I can get it to perform noticeably worse than the 5-10 lines of Javascript would perform.

Is that a scalable solution?

"Lock user input behind a code, quote verbatum user input when it's not surrounded by that code" is probably one of the simplest instruction sets that would be possible to give, and already it's imperfect and has to rely on summaries. This doesn't indicate to me that it's relatively simple to block even the majority of injection attacks, it indicates the opposite. As your instructions get more complicated and the context size increases, blocking prompt injection will get harder, not easier.

You should expect the performance of prompt hardening on systems that are more complicated than your lock and that allow more user input than roughly the size of a tweet to be much worse and to be much harder to pull off. And the process you're describing for your lock already sounds more difficult and less reliable than I think most people would expect it to be. This is not a site/example that is giving me confidence that prompt injection is beatable.

I agree that it is more effort than it should be.

My take on it, ideally we should be able to harden the system with the prompt alone. Without extra code, adapters or filtering. And be able to control the balance between reliability and intelligence. From the reliability of a few lines of Javascript to human level.