Hacker News new | ask | show | jobs
by 2001zhaozhao 15 days ago
You know that it's a honest benchmark when their own model (SWE-1.6) scores terrible on it.