LLM Pentest Leaderboard

Hi HN — we’re the SecureCoders Labs crew .

What we built An open, continuously updated LLM Penetration-Testing Leaderboard. Run 001 pits 8 models against a deliberately vulnerable Express.js app.

Headline results

• Gemini 2.5 Pro (safety-off) found all 9 critical/high vulns.

• Qwen3-30B-a3b-mlx (open source, local on a 2019 MacBook Pro) caught 7/9 with $0 API spend.

• GPT-4-o and Claude Opus produced the most polished write-ups but each missed one bug.

Scope (v1) This first pass measures static bug-hunting skill—think SCA/OWASP Top 10.

Next up: we’ll score exploit writing and automatic PoC execution* so the models must prove they can go from finding to weaponizing a flaw.

Check it out

• Leaderboard + cost/latency numbers https://www.securecoders.com/labs/projects/llm-penetration-t...

• Methodology & prompts (Run 001 analysis) https://securecoders.com/labs/projects/llm-penetration-testi...

Feedback, replication attempts, and ideas for Run 002 are very welcome — we’re hanging out here to discuss!