| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by gizmodo59 99 days ago

Unfortunately the paper doesn’t include gpt 5.3 which was released around the same time as opus 4.6 and also gpt 5.4 few days back. Both are available via api

https://developers.openai.com/api/docs/models/gpt-5.3-codex

IMHO The harness must be used when running these experiments. The model vendors know best on giving the best harness with gpt 5.4 and codex or Claude code with opus 4.6 which makes a big difference if you are running any kind of agentic coding tasks.

I see both Claude and gpt to be neck and neck in coding. Every other model+harness is definitely 3-6 months behind. Right now codex seems to be the best in terms of solving complex bugs, long running tasks, much higher limits and even speed while Claude seems to do well in front end and their cli ux seems nice! Codex app is very good though (wish it wasn’t electron as a memory hog but it’s good)

2 comments

jasonjmcghee 99 days ago

> model vendors know best on giving the best harness

This was only true for Claude Code for a while. Codex was poor and Gemini was unusable.

Since then Codex has gotten quite good.

link

jsemrau 98 days ago

It still fubars my code regularly at 11x the price. Github Copilot Agentic Mode + Sonnet 4.6 is stable and inexpensive.

link

p1esk 99 days ago

Are you saying they did not use native harnesses like Claude Code or Codex? How did they do it then?

link