| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ACCount37 209 days ago
	It's social engineering reborn. This time around, you can social engineer a computer. By understanding LLM psychology and how the post-training process shapes it.

4 comments

andy99 209 days ago

No it’s undefined out-of-distribution performance rediscovered.

link

BobaFloutist 209 days ago

You could say the same about social engineering.

link

adgjlsfhk1 209 days ago

it seems like lots of this is in distribution and that's somewhat the problem. the Internet contains knowledge of how to make a bomb, and therefore so does the llm

link

xg15 209 days ago

Yeah, seems it's more "exploring the distribution" as we don't actually know everything that the AIs are effectively modeling.

link

lawlessone 209 days ago

Am i understanding correctly that in distribution means the text predictor is more likely to predict bad instructions if you already get it to say the words related to the bad instructions?

link

andy99 209 days ago

Basically means the kind of training examples it’s seen. The models have all been fine tuned to refuse to answer certain questions, across many different ways of asking them, including obfuscated and adversarial ones, but poetry is evidently so different from what it’s seen in this type of training that it is not refused.

link

ACCount37 208 days ago

Yes, pretty much. But not just the words themselves - this operates on a level closer to entire behaviors.

If you were a creature born from, and shaped by, the goal of "next word prediction", what would you want?

You would want to always emit predictions that are consistent. Consistency drive. The best predictions for the next word are ones consistent with the past words, always.

A lot of LLM behavior fits this. Few-shot learning, loops, error amplification, sycophancy amplification, and the list goes. Within a context window, past behavior always shapes future behavior.

Jailbreaks often take advantage of that. Multi-turn jailbreaks "boil the frog" - get the LLM to edge closer to "forbidden requests" on each step, until the consistency drive completely overpowers the refusals. Context manipulation jailbreaks, the ones that modify the LLM's own words via API access, establish a context in which the most natural continuation is for the LLM to agree to the request - for example, because it sees itself agreeing to 3 "forbidden" requests before it, and the first word of the next one is already written down as "Sure". "Clusterfuck" style jailbreaks use broken text resembling dataset artifacts to bring the LLM away from "chatbot" distribution and closer to base model behavior, which bypasses a lot of the refusals.

link

CuriouslyC 209 days ago

I like to think of them like Jedi mind tricks.

link

eucyclos 209 days ago

That's my favorite rap artist!

link

layer8 209 days ago

That’s why the term “prompt engineering” is apt.

link

robot-wrangler 209 days ago

Yeah, remember the whole semantic distance vector stuff of "king-man+woman=queen"? Psychometrics might be largely ridiculous pseudoscience for people, but since it's basically real for LLMs poetry does seem like an attack method that's hard to really defend against.

For example, maybe you could throw away gibberish input on the assumption it is trying to exploit entangled words/concepts without triggering guard-rails. Similarly you could try to fight GAN attacks with images if you could reject imperfections/noise that's inconsistent with what cameras would output. If the input is potentially "art" though.. now there's no hard criteria left to decide to filter or reject anything.

link

ACCount37 209 days ago

I don't think humans are fundamentally different. Just more hardened against adversarial exploitation.

"Getting maliciously manipulated by other smarter humans" was a real evolutionary pressure ever since humans learned speech, if not before. And humans are still far from perfect on that front - they're barely "good enough" on average, and far less than that on the lower end.

link

wat10000 209 days ago

Walk out the door carrying a computer -> police called.

Walk out the door carrying a computer and a clipboard while wearing a high-vis vest -> "let me get the door for you."

link

seethishat 209 days ago

Maybe the models can learn to be more cynical.

link