| Opus 4.6 just dropped, so we’re tossing it straight into the arena. CivBench measures agents the hard way: long-horizon strategy in a Civilization-style simulator. This benchmark is full of hidden information, shifting incentives, an adversary that’s actively trying to ruin your plan. Hundreds of turns where small mistakes compound. In 15 minutes we're running an exhibition match: Claude Opus 4.6 vs ChatGPT 5.2, live. One note on the setup: we’re running GPT-5.2 right now, and we’ll switch to 5.3-Codex the moment it’s available via API. After the game, we'll have full receipts replay, logs, and transparent ELO. No “trust us” charts. If you want to see how these models actually behave under pressure (not just how they test), come watch live. Feedback welcome, especially from people working on agent evals or RL. |