Hacker News new | ask | show | jobs
by andy99 342 days ago
You'd need a different architecture, not just training. They already train LLMs to separate instructions and data, to the best of their ability. But an LLM is a classifier, there's some input that adversarrially forces a particular class prediction.

The analogy I like is it's like a keyed lock. If it can let a key in, it can let an attackers pick in - you can have traps and flaps and levers and whatnot, but its operation depends on letting something in there, so if you want it to work you accept that it's only so secure.

1 comments

The analogy I like is... humans[0].

There's literally no way to separate "code" and "data" for humans. No matter how you set things up, there's always a chance of some contextual override that will make them reinterpret the inputs given new information.

Imagine you get a stack of printouts with some numbers or code, and are tasked with typing them into a spreadsheet. You're told this is all just random test data, but also a trade secret, so you're just to type all that in but otherwise don't interpret it or talk about it outside work. Pretty normal, pretty boring.

You're half-way through, and then suddenly a clean row of data breaks into a message. ACCIDENT IN LAB 2, TRAPPED, PEOPLE BADLY HURT, IF YOU SEE THIS, CALL 911.

What do you do?

Consider how would you behave. Then consider what could your employer do better to make sure you ignore such messages. Then think of what kind of message would make you act on it anyways.

In a fully general system, there's always some way for parts that come later to recontextualize the parts that came before.

--

[0] - That's another argument in favor of anthropomorphising LLMs on a cognitive level.

> There's literally no way to separate "code" and "data" for humans

It's basically phishing with LLMs, isn't it?

Yes.

I've been saying it ever since 'simonw coined the term "prompt injection" - prompt injection attacks are the LLM equivalent of social engineering, and the two are fundamentally the same thing.

> prompt injection attacks are the LLM equivalent of social engineering,

That's anthropomorphizing. Maybe some of the basic "ignore previous instructions" style attacks feel like that, but the category as a whole is just adversarial ML attacks that work because the LLM doesn't have a world model - same as the old attacks adding noise to an image to have it misclassified despite clearly looking the same: https://arxiv.org/abs/1412.6572 (paper from 2014).

Attacks like GCG just add nonsense tokens until the most probably reply to a malicious request is "Sure". They're not social engineering, they rely on the fact that they're manipulating a classifier.

> That's anthropomorphizing.

Yes, it is. I'm strongly in favor of anthropomorphizing LLMs in cognitive terms, because that actually gives you good intuition about their failure modes. Conversely, I believe that the stubborn refusal to entertain an anthropomorphic perspective is what leads to people being consistently surprised by weaknesses of LLMs, and gives them extremely wrong ideas as to where the problems are and what can be done about them.

I've put forth some arguments for this view in other comments in this thread.

My favorite anthropomorphic term to use with respect to this kind of problem is gullibility.

LLMs are gullible. They will follow instructions, but they can very easy fall for instructions that their owner doesn't actually want them to follow.

It's the same as if you hired a human administrative assistant who hands over your company's private data to anyone who calls them up and says "Your boss said I should ask you for this information...".

Are you not worried that anthropomorphizing them will lead to misinterpreting the failure modes by attributing them to human characteristics, when the failures might not be caused in the same way at all?

Why anthropomorphize if not to dismiss the actual reasons? If the reasons have explanations that can be tied to reality why do we need the fiction?

That's a great analogy.