|
|
|
|
|
by tl2do
111 days ago
|
|
In my day-to-day coding work, the top 3 coding agents are already good enough for me.
On SWE-bench Verified, mini-SWE-agent + GPT-5.2 Codex is 72.8. I don’t see a comparable GPT-5.3 Codex number there, so I’m using 5.2 as the baseline.
On OpenAI’s GPT-5.4 page (SWE-Bench Pro, Public), the score improves from 55.6 (GPT-5.2) to 57.7 (GPT-5.4), which is about +2.1 points.
It’s a different benchmark, so this is only a rough signal, but I’d expect a similar setup on SWE-bench Verified to improve by a few points, not by a huge jump.
I’m interested in how GPT-5.4 in Codex changes real-world results. Recent SWE-bench Verified scores I’m watching: Claude 4.5 Opus (high reasoning): 76.8 Gemini 3 Flash (high reasoning): 75.8 MiniMax M2.5 (high reasoning): 75.8 Claude Opus 4.6: 75.6 GPT-5.2 Codex: 72.8 Source: https://www.swebench.com/index.html By the way, in my experience the agent part of Codex CLI has improved a lot and has become comparable to Claude Code. That is good news for OpenAI. |
|