| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by leerob 92 days ago
	Are there other coding benchmarks we should include next time? We included Teminal-Bench 2.0 and SWE-bench Mulitilingual. We don't plan on reporting SWE-bench Verified, for similar reasons to OpenAI: https://openai.com/index/why-we-no-longer-evaluate-swe-bench...