Hacker News new | ask | show | jobs
by leerob 92 days ago
Are there other coding benchmarks we should include next time? We included Teminal-Bench 2.0 and SWE-bench Mulitilingual.

We don't plan on reporting SWE-bench Verified, for similar reasons to OpenAI: https://openai.com/index/why-we-no-longer-evaluate-swe-bench...