| Hi HN — we’re the SecureCoders Labs crew . What we built
An open, continuously updated LLM Penetration-Testing Leaderboard.
Run 001 pits 8 models against a deliberately vulnerable Express.js app. Headline results • Gemini 2.5 Pro (safety-off) found all 9 critical/high vulns. • Qwen3-30B-a3b-mlx (open source, local on a 2019 MacBook Pro) caught 7/9 with $0 API spend. • GPT-4-o and Claude Opus produced the most polished write-ups but each missed one bug. Scope (v1)
This first pass measures static bug-hunting skill—think SCA/OWASP Top 10. Next up: we’ll score exploit writing and automatic PoC execution* so the models must prove they can go from finding to weaponizing a flaw. Check it out • Leaderboard + cost/latency numbers
https://www.securecoders.com/labs/projects/llm-penetration-t... • Methodology & prompts (Run 001 analysis)
https://securecoders.com/labs/projects/llm-penetration-testi... Feedback, replication attempts, and ideas for Run 002 are very welcome — we’re hanging out here to discuss! |