| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jameson 68 days ago
	How should one compare benchmark results? For example, SWE-bench Pro improved ~11% compared with Opus 4.6. Should one interpret it as 4.7 is able to solve more difficult problems? or 11% less hallucinations?