| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by IEatPrompts 691 days ago
	Meta's new prompt-guard-86M normally flags almost everything as a jailbreak, but apparently spacing out letters makes it see prompts as harmless. Pretty weird way they found this - instead of hammering it with jailbreaks, they just compared embedding weights with the non fine-tuned model.