Hacker News new | ask | show | jobs
by jameson 68 days ago
How should one compare benchmark results?

For example, SWE-bench Pro improved ~11% compared with Opus 4.6. Should one interpret it as 4.7 is able to solve more difficult problems? or 11% less hallucinations?