Hacker News new | ask | show | jobs
by ethanpil 22 days ago
The table comparing eval scores shows the following:

Agentic Terminal Coding (Terminal-Bench 2.1) Opus 4.8 74.6% GPT 5.5 78.2%

Then, when you scroll all the way down to the bottom Footnotes section it says

"Terminal-Bench 2.1: We reported scores for all models using the Terminus-2 public harness. GPT-5.5’s reported score with the Codex CLI harness is 83.4%."

1 comments

Seems reasonable? Presumably Claude also performs better under the Claude Code harness.
Why not state that?
Maybe the delta is worse under their respective native harnesses.