Hacker News new | ask | show | jobs
by jiayaoqijia 150 days ago

  VibeCodingBench: We benchmarked 15 AI coding models on what developers actually do                                                                      
                                                                                                                                                          
  Current benchmarks have an ecological validity crisis. Models score 70%+ on SWE-bench but struggle in production. Why? They optimize for bug fixes in   
  Python repos—not the auth flows, API integrations, and CRUD dashboards that occupy 80% of real dev work.                                                
                                                                                                                                                          
  So we built VibeCodingBench: 180 tasks across SaaS features, glue code, AI integration, frontend, API integrations, and code evolution.                 
  Multi-dimensional scoring: Functional (40%) + Visual (20%) + Quality (20%) - Cost/Speed penalties. Security gate: Any OWASP Top 10 vuln = automatic 0.  
                                                                                                                                                          
  Top 5 Results (Jan 2026):                                                                                                                               
                                                                                                                                                          
   Claude Opus 4.5 — 89.2% | $12.31 | 44s                                                                                                               
   Claude Haiku 4.5 — 89.0% | $3.03 | 22s                                                                                                               
   Grok 4 Fast — 88.8% | $0.21 | 70s                                                                                                                    
  4⃣ OpenAI GPT-5.2 — 88.8% | $5.01 | 28s                                                                                                                 
  5⃣ Qwen3 Max — 88.6% | $5.42 | 45s                                                                                                                      
                                                                                                                                                          
  The real story? Cost varies 60x between similar performers. Grok 4 Fast matches GPT-5.2 at 1/25th the cost. Claude Haiku 4.5 delivers near-Opus quality 
  for $3 total.    
                                                                                                                                                          
   Live dashboard: https://vibecoding.llmbench.xyz/                                                                                                     
   GitHub repo: https://github.com/alt-research/vibe-coding-benchmark-public                                                                            
   Thesis: https://github.com/alt-research/vibe-coding-benchmark-public/blob/main/docs/THESIS.md                                                        
                                                                                                                                                          
  The ultimate test isn't fixing a bug in scikit-learn. It's shipping a feature your users need—safely, efficiently—before the sprint ends.               
                                                                                                                                                          
  Open source. Contributions welcome.