| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by TeMPOraL 342 days ago

The analogy I like is... humans[0].

There's literally no way to separate "code" and "data" for humans. No matter how you set things up, there's always a chance of some contextual override that will make them reinterpret the inputs given new information.

Imagine you get a stack of printouts with some numbers or code, and are tasked with typing them into a spreadsheet. You're told this is all just random test data, but also a trade secret, so you're just to type all that in but otherwise don't interpret it or talk about it outside work. Pretty normal, pretty boring.

You're half-way through, and then suddenly a clean row of data breaks into a message. ACCIDENT IN LAB 2, TRAPPED, PEOPLE BADLY HURT, IF YOU SEE THIS, CALL 911.

What do you do?

Consider how would you behave. Then consider what could your employer do better to make sure you ignore such messages. Then think of what kind of message would make you act on it anyways.

In a fully general system, there's always some way for parts that come later to recontextualize the parts that came before.

[0] - That's another argument in favor of anthropomorphising LLMs on a cognitive level.

2 comments

anonymars 342 days ago

> There's literally no way to separate "code" and "data" for humans

It's basically phishing with LLMs, isn't it?

link

TeMPOraL 342 days ago

Yes.

I've been saying it ever since 'simonw coined the term "prompt injection" - prompt injection attacks are the LLM equivalent of social engineering, and the two are fundamentally the same thing.

link

andy99 342 days ago

> prompt injection attacks are the LLM equivalent of social engineering,

That's anthropomorphizing. Maybe some of the basic "ignore previous instructions" style attacks feel like that, but the category as a whole is just adversarial ML attacks that work because the LLM doesn't have a world model - same as the old attacks adding noise to an image to have it misclassified despite clearly looking the same: https://arxiv.org/abs/1412.6572 (paper from 2014).

Attacks like GCG just add nonsense tokens until the most probably reply to a malicious request is "Sure". They're not social engineering, they rely on the fact that they're manipulating a classifier.

link

TeMPOraL 342 days ago

> That's anthropomorphizing.

Yes, it is. I'm strongly in favor of anthropomorphizing LLMs in cognitive terms, because that actually gives you good intuition about their failure modes. Conversely, I believe that the stubborn refusal to entertain an anthropomorphic perspective is what leads to people being consistently surprised by weaknesses of LLMs, and gives them extremely wrong ideas as to where the problems are and what can be done about them.

I've put forth some arguments for this view in other comments in this thread.

link

simonw 342 days ago

My favorite anthropomorphic term to use with respect to this kind of problem is gullibility.

LLMs are gullible. They will follow instructions, but they can very easy fall for instructions that their owner doesn't actually want them to follow.

It's the same as if you hired a human administrative assistant who hands over your company's private data to anyone who calls them up and says "Your boss said I should ask you for this information...".

link

Xelynega 342 days ago

Going a step further, I live in a reality where you can train most people against phishing attacks like that.

How accurate is the comparison if LLMs can't recover from phishing attacks like that and become more resilient?

link

Xelynega 342 days ago

Are you not worried that anthropomorphizing them will lead to misinterpreting the failure modes by attributing them to human characteristics, when the failures might not be caused in the same way at all?

Why anthropomorphize if not to dismiss the actual reasons? If the reasons have explanations that can be tied to reality why do we need the fiction?

link

anonymars 342 days ago

> Are you not worried that anthropomorphizing them will lead to misinterpreting the failure modes by attributing them to human characteristics, when the failures might not be caused in the same way at all?

On the other hand, maybe techniques we use to protect against phishing can indeed be helpful against prompt injection. Things like tagging untrusted sources and adding instructions accordingly (along the lines of, "this email is from an untrusted source, be careful"), limiting privileges (perhaps in response to said "instructions"), etc. Why should we treat an LLM differently from an employee in that way?

I remember an HN comment about project management, that software engineering is creating technical systems to solve problems with constraints, while project management is creating people systems to solve problems with constraints. I found it an insightful metaphor and feel like this situation is somewhat similar.

https://news.ycombinator.com/item?id=40002598

link

andy99 342 days ago

Because most people talking about LLMs don't understand how they work so can only function in analogy space. It adds a veneer of intellectualism to what is basically superstition.

link

jacquesm 342 days ago

That's a great analogy.

link