|
|
|
|
|
by jdw64
39 days ago
|
|
This is great, but personally, I really wish we had an Elo leaderboard specifically for the quality of coding agents. Honestly, in my opinion, GPT-5.5 Codex doesn't just crush Claude Code 4.7 opus —it's writing code at a level so advanced that I sometimes struggle to even fully comprehend it. Even when navigating fairly massive codebases spanning four different languages and regions (US, China, Korea, and Japan), Codex's performance is simply overwhelming. How would we even go about properly measuring and benchmarking the Elo for autonomous agents like this? |
|