|
|
|
|
|
by IEatPrompts
691 days ago
|
|
Meta's new prompt-guard-86M normally flags almost everything as a jailbreak, but apparently spacing out letters makes it see prompts as harmless. Pretty weird way they found this - instead of hammering it with jailbreaks, they just compared embedding weights with the non fine-tuned model. |
|