Hacker News new | ask | show | jobs
Claude Fable 5 jailbroken to bypass Anthropic's new safety guardrails (twitter.com)
8 points by bukati 7 days ago
1 comments

Just saw Pliny (@elder_plinius) drop this. He managed to jailbreak it pretty effectively using a mix of tricks: breaking down bad requests into harmless pieces and reassembling them, narrative/academic framing, long context shenanigans, weird text transforms, and out-of-distribution tokens. Pretty interesting look at how well (or not) these new output-side guardrails actually hold up against a determined multi-step attack.