| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by tl2do 111 days ago

In my day-to-day coding work, the top 3 coding agents are already good enough for me. On SWE-bench Verified, mini-SWE-agent + GPT-5.2 Codex is 72.8. I don’t see a comparable GPT-5.3 Codex number there, so I’m using 5.2 as the baseline. On OpenAI’s GPT-5.4 page (SWE-Bench Pro, Public), the score improves from 55.6 (GPT-5.2) to 57.7 (GPT-5.4), which is about +2.1 points. It’s a different benchmark, so this is only a rough signal, but I’d expect a similar setup on SWE-bench Verified to improve by a few points, not by a huge jump. I’m interested in how GPT-5.4 in Codex changes real-world results.

Recent SWE-bench Verified scores I’m watching:

Claude 4.5 Opus (high reasoning): 76.8

Gemini 3 Flash (high reasoning): 75.8

MiniMax M2.5 (high reasoning): 75.8

Claude Opus 4.6: 75.6

GPT-5.2 Codex: 72.8

Source: https://www.swebench.com/index.html

By the way, in my experience the agent part of Codex CLI has improved a lot and has become comparable to Claude Code. That is good news for OpenAI.

1 comments

kaufmann 110 days ago

I would recommend https://swe-rebench.com for comparison. It is always based on new problems.

link