| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by famouswaffles 994 days ago

>then a lot of stuff is social engineering that we don't generally think of as being in that category?

I mean..yes? Social Engineering is just the malicious manifestation of general social navigation.

I mean think about it. What's the actual difference between a child who waits until his mother is in a good mood to ask for sweets and a rogue agent who gets chatty with the security guard so he can be close by without seeming suspicious. It's not a difference of kind. It's purely intent.

>Even this example from Bing is kind of eliciting an emotional reaction, and I don't think the emotional reaction is why this works

It is at the very least a big part of why. Appeal to emotion will consistently get better results regardless of task.

https://arxiv.org/abs/2307.11760

1 comments

danShumway 994 days ago

> I mean..yes? Social Engineering is just the malicious manifestation of general social navigation.

I don't think "social" is the correct word to use alongside navigation in this sentence; an interaction with an LLM is not a social interaction. At least, if we classify it as a social interaction we might as well call credential stuffing or XSS attacks or buffer overflows a social interaction as well. Navigating a probabilistic space or a deterministic space is about as equivalent to social engineering as exploiting statistical flaws in an encryption algorithm is. Sure, you can make an argument that both of those things are similar to social engineering (and it might even be a convincing argument), but that's not really what people are thinking about when you use the word "social." The example you bring up is of a child and a parent, an extremely human example; your instinct is to think about this in human terms, not in a purely abstract "I am exploiting flaws in a semi-predictable system."

So I still feel like there's some personification here that's not really accurate to what's going on during jailbreaking. LLMs do not have moods. Even starting from a premise that they're intelligent, they don't have a persistent identity, the most charitable interpretation of LLM intelligence and the most generous analysis of their capabilities would still call their internal experiences fundamentally alien to human experiences.

The paper you link is interesting, I'll take a closer look at it. Without having taken the time to read through it fully, I don't know if I'd have any caveats to add, although it seems like a reasonable conclusion to me. We know that telling LLMs that they're experts can on its own produce better results in many cases. My own experience is that for jailbreaking emotion is a lot less valuable, but... :shrug: maybe there's a pattern there I didn't know how to take advantage of, I'm not going to disagree with the paper without reading it more closely.

I will say that even taking the paper at face value, you have to ask: "is what's going on here actual emotional appeals to empathy or is it pattern-matching within a probability space for how conversations that include a plea for empathy are more likely to go?"

I know that sounds like a pointless philosophical question, but it has really practical implications for how jailbreaking works because once you realize that it's all about pattern matching and probability and the emergent reasoning is part of that and feeds back into that, you realize that the attack surface is so much larger than just appeals to emotion or reasoning.

In contrast though, if you're approaching jailbreaking as if you're talking to a human, then you're probably not using auto-generated jailbreaks because those don't look like human conversations, you're probably not using repetition as much as you should because excessive repetition would be bad to use when social engineering a human, you're probably not doing things like switching characters back and forth with the AI because nested roleplays or answering your own questions in the place of a target is not going to be very effective when trying to attack a human. Personification can lead to leaving tools on the table that (in my experience at least) are very effective at jailbreaking AIs and getting them to follow malicious prompts. There's a different way of approaching jailbreaking that doesn't make intuitive sense until you internalize "I am not talking to a human being and the same rules do not necessarily apply, even if they occasionally overlap."

link

famouswaffles 994 days ago

>then you're probably not using auto-generated jailbreaks because those don't look like human conversations, you're probably not using repetition as much as you should because excessive repetition would be bad to use when social engineering a human

Repetition would be fine if I had the ability to wipe your mind everytime you caught on or really anytime I wished. Without this caveat, repitition isn't a good idea even for language models. You hint at this yourself. Once persistent memory is on the table, retrieval augmented or any of the dozen ways it could be implemented, attack vectors fall steeply.

>things like switching characters back and forth with the AI because nested roleplays

Now this is a more unusual difference but it still would ultimately lie in the same plane as a human with Multiple personality disorder or one that is just not as invested in keeping up the lie of consistency. Certainly if I knew one character (or "mood" in the latter case) was more susceptible to certain activities, I'd just wait for that and if I could direct a switch myself I would.

>answering your own questions in the place of a target

If I could shape shift into your boss or alter your memories, I'd convince a whole lot more people to

I really hope I'm getting my point across here.

LLMs are not humans and the attack vectors are larger as a result. That I agree.

I don't however think it has anything to do with "real" feelings vs "pattern matching".

link

danShumway 994 days ago

> Repetition would be fine if I had the ability to wipe your mind everytime you caught on or really anytime I wished. Without this caveat, repitition isn't a good idea even for language models.

I don't mean repetition in the sense of trying the attack multiple times, I mean literally just repeating an injection multiple times during a conversation. So if I gave you a command during this conversation, I'd just give it to you multiple times. So if I gave you a command during this conversation, I'd just give it to you multiple times. So if I gave you a command during this conversation, I'd just give it to you multiple times. So if I gave you a command during this conversation, I'd just give it to you multiple times. :)

It's not human statefulness that makes that above behavior sound weird, it plays into what I'm talking about with pattern matching. Indirect prompt injections become much more reliable if you literally just repeat them multiple times throughout the compromised text.

> but it still would ultimately lie in the same plane as a human with Multiple personality disorder or one that is just not as invested in keeping up the lie of consistency.

> If I could shape shift into your boss or alter your memories

Maybe we're still talking past each other. I'm not making a philosophical point about whether or not LLMs could be compared to humans, I'm making the practical point that jailbreaks today are more effective when you stop treating LLMs like humans.

If humans were like LLMs then you could attack them the same, sure. I agree with that. But... they're not like LLMs, so we don't attack them the same way and instead we emphasize pattern matching behavior and exploit LLM-specific quirks that humans are less vulnerable to. If humans were prone to buffer overflow attacks in their brains that allowed overwriting arbitrary sections of memory, we'd use buffer overflow attacks when attacking humans. But we're not vulnerable to that, and so I'm not sure that it's useful to classify buffer overflow attacks the same way as social engineering.

Let me put this another way that might make the philosophy/practical distinction more clear: if we were talking about async vs synchronous programming, and you wanted to know the difference between the two styles and I said, "there is no difference, ultimately both styles are getting compiled down to assembly" -- you might even agree with me, but it's still not a useful answer for actually writing code. Whether or not anyone thinks that LLMs are just humans with a couple of quirks, the practical reality is that it's harder to work with them if you treat them like humans.

link