|
|
|
|
|
by ethanpil
22 days ago
|
|
The table comparing eval scores shows the following: Agentic Terminal Coding (Terminal-Bench 2.1)
Opus 4.8 74.6%
GPT 5.5 78.2% Then, when you scroll all the way down to the bottom Footnotes section it says "Terminal-Bench 2.1: We reported scores for all models using the Terminus-2 public harness. GPT-5.5’s reported score with the Codex CLI harness is 83.4%." |
|