|
|
|
|
|
by thegeomaster
308 days ago
|
|
SWE-Bench Verified score, with thinking, ties Opus 4.1 without thinking. AIME scores do not appear too impressive at first glance. They are downplaying benchmarks heavily in the live stream. This was the lab that has been flexing benchmarks as headline figures since forever. This is a product-focused update. There is no significant jump in raw intelligence or agentic behavior against SOTA. |
|