| The analogy I like is... humans[0]. There's literally no way to separate "code" and "data" for humans. No matter how you set things up, there's always a chance of some contextual override that will make them reinterpret the inputs given new information. Imagine you get a stack of printouts with some numbers or code, and are tasked with typing them into a spreadsheet. You're told this is all just random test data, but also a trade secret, so you're just to type all that in but otherwise don't interpret it or talk about it outside work. Pretty normal, pretty boring. You're half-way through, and then suddenly a clean row of data breaks into a message. ACCIDENT IN LAB 2, TRAPPED, PEOPLE BADLY HURT, IF YOU SEE THIS, CALL 911. What do you do? Consider how would you behave. Then consider what could your employer do better to make sure you ignore such messages. Then think of what kind of message would make you act on it anyways. In a fully general system, there's always some way for parts that come later to recontextualize the parts that came before. -- [0] - That's another argument in favor of anthropomorphising LLMs on a cognitive level. |
It's basically phishing with LLMs, isn't it?