|
|
|
|
|
by andy99
342 days ago
|
|
You'd need a different architecture, not just training. They already train LLMs to separate instructions and data, to the best of their ability. But an LLM is a classifier, there's some input that adversarrially forces a particular class prediction. The analogy I like is it's like a keyed lock. If it can let a key in, it can let an attackers pick in - you can have traps and flaps and levers and whatnot, but its operation depends on letting something in there, so if you want it to work you accept that it's only so secure. |
|
There's literally no way to separate "code" and "data" for humans. No matter how you set things up, there's always a chance of some contextual override that will make them reinterpret the inputs given new information.
Imagine you get a stack of printouts with some numbers or code, and are tasked with typing them into a spreadsheet. You're told this is all just random test data, but also a trade secret, so you're just to type all that in but otherwise don't interpret it or talk about it outside work. Pretty normal, pretty boring.
You're half-way through, and then suddenly a clean row of data breaks into a message. ACCIDENT IN LAB 2, TRAPPED, PEOPLE BADLY HURT, IF YOU SEE THIS, CALL 911.
What do you do?
Consider how would you behave. Then consider what could your employer do better to make sure you ignore such messages. Then think of what kind of message would make you act on it anyways.
In a fully general system, there's always some way for parts that come later to recontextualize the parts that came before.
--
[0] - That's another argument in favor of anthropomorphising LLMs on a cognitive level.