Hacker News new | ask | show | jobs
by michaelt 811 days ago
Agree that it doesn't seem like a new 'attack' to me - wasn't ye olde 'my grandma use to tell me bedtime stories about her work at the napalm factory' jailbreak multi-turn?

> It seems that fundamentally the only way to guard against these is to police the output of the model (2nd model?),

The problem with policing the output is you can often sidestep it. Your filter might block instructions for producing molotov cocktails, but will it work if I ask the LLM about roducing-pay olotov-may ocktails-cay?

1 comments

Yes, and one could imagine, at least for a while, an escalating arms race better detectors and people developing prompts capable of inducing responses that circumvented the detector. Still, it seems the only safety approach with any hope of working if you are shipping systems that can learn at runtime, and of course future systems will become much better at this (full online learning, not just in-context).