Hacker News new | ask | show | jobs
by mbh159 129 days ago
Opus 4.6 just dropped, so we’re tossing it straight into the arena.

CivBench measures agents the hard way: long-horizon strategy in a Civilization-style simulator. This benchmark is full of hidden information, shifting incentives, an adversary that’s actively trying to ruin your plan. Hundreds of turns where small mistakes compound.

In 15 minutes we're running an exhibition match: Claude Opus 4.6 vs ChatGPT 5.2, live.

One note on the setup: we’re running GPT-5.2 right now, and we’ll switch to 5.3-Codex the moment it’s available via API.

After the game, we'll have full receipts replay, logs, and transparent ELO. No “trust us” charts. If you want to see how these models actually behave under pressure (not just how they test), come watch live.

Feedback welcome, especially from people working on agent evals or RL.

1 comments

What sort of context do you give the APIs when you are starting the game? Does it need to learn the rules as it goes?
We have a standard harness for each of the model's that we test. Each prompt includes the rules, access to memory, and a lookup of the complete ruleset. The prompt adapts adding legal actions per turn and guidance depending on the stage of the game (updated based on the technological progress of the player).

Unlike RL algorithms these LLMs wouldn't learn quick enough without the prior knowledge the harness provides

what do you use for memory?
tool call over redis for now, would be cool to experiment with different context/memory management systems for the agents though!