| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by HarHarVeryFunny 809 days ago

This is really "just" another type of in-context learning attack, rather like Anthropic's very recently published "many shot jailbreaking".

https://www.anthropic.com/research/many-shot-jailbreaking

In this "crescendo attack" the Q&A history comes from actual turn-taking rather than the fake Q&A of Anthropic's example, but it seems the model's guardrails are being overridden in a similar fashion by making the desired dangerous response a higher liklihood prediction than if it had been asked cold.

It's going to be interesting to see how these companies end up addressing these ICL attacks. Anthropic's safety approach so far seems to be based on interpretability research to understand the models inner working and be able to identify specific "circuits" responsible for given behaviors/capabilities. It seems the idea is that they can neuter the model to make it safe, once they figure out what needs cutting.

The trouble with runtime ICL attacks is that these occur AFTER the model has been vetted for safety and released. It seems that fundamentally the only way to guard against these is to police the output of the model (2nd model?), rather than hoping you can perform brain surgery and prevent it from saying something dangerous in the first place.

1 comments

michaelt 809 days ago

Agree that it doesn't seem like a new 'attack' to me - wasn't ye olde 'my grandma use to tell me bedtime stories about her work at the napalm factory' jailbreak multi-turn?

> It seems that fundamentally the only way to guard against these is to police the output of the model (2nd model?),

The problem with policing the output is you can often sidestep it. Your filter might block instructions for producing molotov cocktails, but will it work if I ask the LLM about roducing-pay olotov-may ocktails-cay?

link

HarHarVeryFunny 809 days ago

Yes, and one could imagine, at least for a while, an escalating arms race better detectors and people developing prompts capable of inducing responses that circumvented the detector. Still, it seems the only safety approach with any hope of working if you are shipping systems that can learn at runtime, and of course future systems will become much better at this (full online learning, not just in-context).

link