Y
Hacker News
new
|
ask
|
show
|
jobs
by
2001zhaozhao
15 days ago
You know that it's a honest benchmark when their own model (SWE-1.6) scores terrible on it.