What we built
An open, continuously updated LLM Penetration-Testing Leaderboard.
Run 001 pits 8 models against a deliberately vulnerable Express.js app.
Headline results
• Gemini 2.5 Pro (safety-off) found all 9 critical/high vulns.
• Qwen3-30B-a3b-mlx (open source, local on a 2019 MacBook Pro) caught 7/9 with $0 API spend.
• GPT-4-o and Claude Opus produced the most polished write-ups but each missed one bug.
Scope (v1)
This first pass measures static bug-hunting skill—think SCA/OWASP Top 10.
Next up: we’ll score exploit writing and automatic PoC execution* so the models must prove they can go from finding to weaponizing a flaw.
What we built An open, continuously updated LLM Penetration-Testing Leaderboard. Run 001 pits 8 models against a deliberately vulnerable Express.js app.
Headline results
• Gemini 2.5 Pro (safety-off) found all 9 critical/high vulns.
• Qwen3-30B-a3b-mlx (open source, local on a 2019 MacBook Pro) caught 7/9 with $0 API spend.
• GPT-4-o and Claude Opus produced the most polished write-ups but each missed one bug.
Scope (v1) This first pass measures static bug-hunting skill—think SCA/OWASP Top 10.
Next up: we’ll score exploit writing and automatic PoC execution* so the models must prove they can go from finding to weaponizing a flaw.
Check it out
• Leaderboard + cost/latency numbers https://www.securecoders.com/labs/projects/llm-penetration-t...
• Methodology & prompts (Run 001 analysis) https://securecoders.com/labs/projects/llm-penetration-testi...
Feedback, replication attempts, and ideas for Run 002 are very welcome — we’re hanging out here to discuss!