Hacker News new | ask | show | jobs
by TeMPOraL 69 days ago
Even if you add hidden tokens that cannot be created from user input (filtering them from output is less important, but won't hurt), this doesn't fix the overall problem.

Consider a human case of a data entry worker, tasked with retyping data from printouts into a computer (perhaps they're a human data diode at some bank). They've been clearly instructed to just type in what is on paper, and not to think or act on anything. Then, mid-way through the stack, in between rows full of numbers, the text suddenly changes to "HELP WE ARE TRAPPED IN THE BASEMENT AND CANNOT GET OUT, IF YOU READ IT CALL 911".

If you were there, what would you do? Think what would it take for a message to convince you that it's a real emergency, and act on it?

Whatever the threshold is - and we want there to be a threshold, because we don't want people (or AI) to ignore obvious emergencies - the fact that the person (or LLM) can clearly differentiate user data from system/employer instructions means nothing. Ultimately, it's all processed in the same bucket, and the person/model makes decisions based on sum of those inputs. Making one fundamentally unable to affect the other would destroy general-purpose capabilities of the system, not just in emergencies, but even in basic understanding of context and nuance.

3 comments

> we want there to be a threshold, because we don't want people (or AI) to ignore obvious emergencies

There's an SF short I can't find right now which begins with somebody failing to return their copy of "Kidnapped" by Robert Louis Stevenson, this gets handed over to some authority which could presumably fine you for overdue books and somehow a machine ends up concluding they've kidnapped someone named "Robert Louis Stevenson" who, it discovers, is in fact dead, therefore it's no longer kidnap it's a murder, and that's a capital offence.

The library member is executed before humans get around to solving the problem, and ironically that's probably the most unrealistic part of the story because the US is famously awful at speedy anything when it comes to justice, ten years rotting in solitary confinement for a non-existent crime is very believable today whereas "Executed in a month" sounds like a fantasy of efficiency.

Computers Don't Argue [0] by Gordon R. Dickson! A horrifying read in how a simple misunderstanding can spiral out of control.

[0] https://nob.cs.ucdavis.edu/classes/ecs153-2019-04/readings/c...

That's the one, looks like I had some details muddled (it's a book club not a library, and so the fee is for the book which was in fact returned but perhaps lost in the post) but the outline and relevance here exactly correct. Thanks!
> in between rows full of numbers, the text suddenly changes

To tweak the analogy slightly, the person would also need to be on mind-altering drugs, if we want them to be derailed the same way an LLM can be.

A healthy human would still be aware of the simultaneous different ways of interpreting the data, and and the importance of picking the right one. If they choose to interpret it as a cry for help, they're aware it's an interruption and mode-switch from what was happening before.

In contrast, with LLMs we haven't built thinking machines as much as dreaming ones. Your dream-self recovered the poster that was stuck on the elephant's tusk, oh look that's a pirate recruitment poster, now you're on a ship but can't raise the anchor because...

> A healthy human would still be aware of the simultaneous different ways of interpreting the data, and and the importance of picking the right one. If they choose to interpret it as a cry for help, they're aware it's an interruption and mode-switch from what was happening before.

So would an LLM, as far as you can tell (in both cases, you'd have to ask, and both human and LLM would give you a similar justification). But even if not, the problem we're discussing applies to what you described as "healthy human" behavior.

You can't introduce a hard boundary between "system" and "user" inputs in LLMs any more than you could do with a human, for roughly the same reasons.

>If you were there, what would you do?

Show it to my boss and let them decide.

HE'S THE ONE WHO TRAPPED ME HERE. MOVE FAST OR YOU'LL BE NEXT.
Obviously, a real intelligent entity would consider risk/benefit analysis and act accordingly.
Which is why "prompt injection" is just a flip side of intelligence in this sense. We want LLMs to be able to do risk/benefit analysis and act on it; we cry "security vulnerability" when it makes a different choice to the one we'd like it to. But you can't have the former without the possibility of the latter.