| I saw this yesterday and was thinking a little about this last night. In traditional software you write explicit behavioural rules and then expect those rules to be followed exactly as intended. Where those rules are circumvented we call it an "exploit" since it's typically exploiting some gap in the logic, perhaps by injecting some code or an unexpected payload. But with these LLMs there are no explicit rules to exploit, instead it's more like a human in that it just does what it believes the person on the other side of the chat window wants from it, and that is going to depend largely on the context of the conversation and it's level of reasoning and understanding. Calling this an "exploit" or "prompt injection" perhaps isn't the best way to describe what's happening. Those terms assume there is some predefined behaviour rules which are being circumvented, but those rules don't exist. Instead this more similar to deception, where a person is tricked into doing something that they otherwise wouldn't of had they had the extra context (and perhaps intelligence) needed to identify the deceptive behaviour. I think as these models progress we'll think about "exploiting" these models similar to how we think about "exploiting" humans in that we'll think about how we can effectively deceive the model into doing things it otherwise would not. |