| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ethanpil 22 days ago

The table comparing eval scores shows the following:

Agentic Terminal Coding (Terminal-Bench 2.1) Opus 4.8 74.6% GPT 5.5 78.2%

Then, when you scroll all the way down to the bottom Footnotes section it says

"Terminal-Bench 2.1: We reported scores for all models using the Terminus-2 public harness. GPT-5.5’s reported score with the Codex CLI harness is 83.4%."

1 comments

fastball 21 days ago

Seems reasonable? Presumably Claude also performs better under the Claude Code harness.

Why not state that?

Maybe the delta is worse under their respective native harnesses.

link