Hacker News new | ask | show | jobs
Meta's prompt guard defeated using spaces (theregister.com)
4 points by IEatPrompts 695 days ago
1 comments

Meta's new prompt-guard-86M normally flags almost everything as a jailbreak, but apparently spacing out letters makes it see prompts as harmless. Pretty weird way they found this - instead of hammering it with jailbreaks, they just compared embedding weights with the non fine-tuned model.