|
|
|
|
|
by famouswaffles
994 days ago
|
|
>then a lot of stuff is social engineering that we don't generally think of as being in that category? I mean..yes? Social Engineering is just the malicious manifestation of general social navigation. I mean think about it. What's the actual difference between a child who waits until his mother is in a good mood to ask for sweets and a rogue agent who gets chatty with the security guard so he can be close by without seeming suspicious. It's not a difference of kind. It's purely intent. >Even this example from Bing is kind of eliciting an emotional reaction, and I don't think the emotional reaction is why this works It is at the very least a big part of why. Appeal to emotion will consistently get better results regardless of task. https://arxiv.org/abs/2307.11760 |
|
I don't think "social" is the correct word to use alongside navigation in this sentence; an interaction with an LLM is not a social interaction. At least, if we classify it as a social interaction we might as well call credential stuffing or XSS attacks or buffer overflows a social interaction as well. Navigating a probabilistic space or a deterministic space is about as equivalent to social engineering as exploiting statistical flaws in an encryption algorithm is. Sure, you can make an argument that both of those things are similar to social engineering (and it might even be a convincing argument), but that's not really what people are thinking about when you use the word "social." The example you bring up is of a child and a parent, an extremely human example; your instinct is to think about this in human terms, not in a purely abstract "I am exploiting flaws in a semi-predictable system."
So I still feel like there's some personification here that's not really accurate to what's going on during jailbreaking. LLMs do not have moods. Even starting from a premise that they're intelligent, they don't have a persistent identity, the most charitable interpretation of LLM intelligence and the most generous analysis of their capabilities would still call their internal experiences fundamentally alien to human experiences.
The paper you link is interesting, I'll take a closer look at it. Without having taken the time to read through it fully, I don't know if I'd have any caveats to add, although it seems like a reasonable conclusion to me. We know that telling LLMs that they're experts can on its own produce better results in many cases. My own experience is that for jailbreaking emotion is a lot less valuable, but... :shrug: maybe there's a pattern there I didn't know how to take advantage of, I'm not going to disagree with the paper without reading it more closely.
I will say that even taking the paper at face value, you have to ask: "is what's going on here actual emotional appeals to empathy or is it pattern-matching within a probability space for how conversations that include a plea for empathy are more likely to go?"
I know that sounds like a pointless philosophical question, but it has really practical implications for how jailbreaking works because once you realize that it's all about pattern matching and probability and the emergent reasoning is part of that and feeds back into that, you realize that the attack surface is so much larger than just appeals to emotion or reasoning.
In contrast though, if you're approaching jailbreaking as if you're talking to a human, then you're probably not using auto-generated jailbreaks because those don't look like human conversations, you're probably not using repetition as much as you should because excessive repetition would be bad to use when social engineering a human, you're probably not doing things like switching characters back and forth with the AI because nested roleplays or answering your own questions in the place of a target is not going to be very effective when trying to attack a human. Personification can lead to leaving tools on the table that (in my experience at least) are very effective at jailbreaking AIs and getting them to follow malicious prompts. There's a different way of approaching jailbreaking that doesn't make intuitive sense until you internalize "I am not talking to a human being and the same rules do not necessarily apply, even if they occasionally overlap."