| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by thegeomaster 356 days ago

SWE-Bench Verified score, with thinking, ties Opus 4.1 without thinking.

AIME scores do not appear too impressive at first glance.

They are downplaying benchmarks heavily in the live stream. This was the lab that has been flexing benchmarks as headline figures since forever.

This is a product-focused update. There is no significant jump in raw intelligence or agentic behavior against SOTA.

2 comments

what does it mean for a bench to be not impressive when it's saturated?

they aren't downplaying anything.