| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Horos 150 days ago

Interesting gap between surface scanning (6.6%) and AI deep audit (16.4%).

Two concerns with the AI audit approach. First, the defense LLM is itself an attack surface — we're already seeing payloads crafted specifically to bypass LLM-based guardrails. If the guardian is injectable, you've added a vulnerability to your security stack.

Second, the Mindgard paper from late 2025 tested 12 character injection techniques against 6 guardrails including ProtectAI's DeBERTa, Meta Prompt Guard, Azure Prompt Shield — some hit 100% evasion rate. Homoglyphs, zero-width chars, leet, diacritics. Simple stuff, but the classifiers see raw tokens and can't handle it.

I built a prompt injection detection library that tackles this from the normalisation layer — 10-stage deterministic pipeline (NFKD, confusable fold, leet, base64, zero-width strip, ROT13, escape sequences) that reduces all evasion to canonical form before any matching. The scan itself is not injectable — it's code, not a model.

Where I think this goes next: small encoder-only classifiers (DeBERTa-small, ModernBERT) running on already-normalised text. Post-normalisation, the model only needs to detect the logical intent pattern, not handle evasion — that's the layer below. Too small to be reprogrammed via prompt, too focused to be redirected. One classifier per attack category: override, extraction, jailbreak, etc.

But these classifiers will only be as good as their training data. Right now everyone trains on static datasets (deepset, safeguard). What's missing is a community-maintained corpus fed by real-world incident reports — like antivirus signature databases. The detection engine matters less than the definitions it runs on. ClamAV isn't great because of its scan loop, it's great because thousands of people report samples.

Your foundry example — no payload until the agent writes it — is the genuinely hard case that needs AI. But for everything else, deterministic normalisation + focused micro-classifiers + community-curated signatures is a more defensible architecture than putting another LLM in the path.

https://github.com/hazyhaar/pkg/tree/main/injection