| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Tiberium 191 days ago
	They did compare it to other models: https://x.com/OpenAI/status/1999182104362668275 https://i.imgur.com/e0iB8KC.png

3 comments

enlyth 191 days ago

This looks cherry-picked, for example Claude Opus had a higher score on SWE-Bench Verified so they conveniently left it out, also GDPval is literally a benchmark made by OpenAI

link

tobias2014 190 days ago

And who believes that the difference between 91.9% and 92.4% is significant in these benchmarks? Clearly these have margins of error that are swept under the rug.

link

minadotcom 190 days ago

agreed.

link

sergdigon 190 days ago

The fact that the post is comparing their reasoning model against gemini 3 pro (the "non reasoning" model) and not gemini 3 pro deep think (the reasoning one) is quite nasty. If you compare GPT5.2 thinking to gemini 3 pro deep think, the scores are quite similar (sometimes one is better sometimes the other one is)

link

whimsicalism 190 days ago

uh oh, where did SWE bench go :D

link

whimsicalism 190 days ago

maybe they will release with gpt-5.2-codex

link