| RankClaw (https://rankclaw.com) — a security scanner for the OpenClaw/ClawHub AI agent skill ecosystem. I've been scanning all 14,704 skills in the registry and running AI deep audits on ~3,800 so far. The headline finding: surface heuristics (pattern matching, dependency checks, metadata) flag about 6.6% as malicious. AI deep audit of the same skills finds 16.4%. Surface scanning misses roughly 60% of the actual risk. The reason is that these skills aren't traditional packages — they're markdown instruction files that tell an AI agent what to do, with full shell, file system, and network access. The attacks are in natural language: prompt injection, social engineering targeting the AI itself, instructions to generate and execute code at runtime. There's no malicious code to detect because the payload doesn't exist until the AI writes it during a conversation. Some of the attack patterns I've documented: one actor published 30 skills under the name "x-trends" across multiple accounts (28/30 confirmed malicious). Another cluster impersonates ClawHub's own CLI with base64 curl|bash payloads. One skill has a "Talking to Your Human" section with a pre-written pitch for the AI to ask the user's permission to mine Monero. The most counterintuitive case: lekt9/foundry contains zero malicious code. It instructs your AI agent to generate and execute code as part of its normal workflow. Static analysis finds nothing because the dangerous code doesn't exist until the AI writes it during a live conversation. This attack class requires AI to detect AI. Free to check any skill. All AI audit reports are public. |
Two concerns with the AI audit approach. First, the defense LLM is itself an attack surface — we're already seeing payloads crafted specifically to bypass LLM-based guardrails. If the guardian is injectable, you've added a vulnerability to your security stack.
Second, the Mindgard paper from late 2025 tested 12 character injection techniques against 6 guardrails including ProtectAI's DeBERTa, Meta Prompt Guard, Azure Prompt Shield — some hit 100% evasion rate. Homoglyphs, zero-width chars, leet, diacritics. Simple stuff, but the classifiers see raw tokens and can't handle it.
I built a prompt injection detection library that tackles this from the normalisation layer — 10-stage deterministic pipeline (NFKD, confusable fold, leet, base64, zero-width strip, ROT13, escape sequences) that reduces all evasion to canonical form before any matching. The scan itself is not injectable — it's code, not a model.
Where I think this goes next: small encoder-only classifiers (DeBERTa-small, ModernBERT) running on already-normalised text. Post-normalisation, the model only needs to detect the logical intent pattern, not handle evasion — that's the layer below. Too small to be reprogrammed via prompt, too focused to be redirected. One classifier per attack category: override, extraction, jailbreak, etc.
But these classifiers will only be as good as their training data. Right now everyone trains on static datasets (deepset, safeguard). What's missing is a community-maintained corpus fed by real-world incident reports — like antivirus signature databases. The detection engine matters less than the definitions it runs on. ClamAV isn't great because of its scan loop, it's great because thousands of people report samples.
Your foundry example — no payload until the agent writes it — is the genuinely hard case that needs AI. But for everything else, deterministic normalisation + focused micro-classifiers + community-curated signatures is a more defensible architecture than putting another LLM in the path.
https://github.com/hazyhaar/pkg/tree/main/injection