| I’ve been running a bunch of coding agents on benchmarks recently as part of consulting, and this is actually much more impressive than it seems at first glance. 71.2% puts it at 5th, which is 4 points below the leader (four points is a lot) and just over 1% lower than Anthropic’s own submission for Claude Sonnet 4 - the same model these guys are running. But the top rated submissions aren’t running production products. They generally have extensive scaffolding or harnesses that were built *specifically for SWE bench*, which kind of defeats the whole purpose of the benchmark. Take for example Refact which is at #2 with 74.4%, they built a 2k lines of code framework around their agent specifically for SWE bench (https://github.com/smallcloudai/refact-bench/). It’s pretty elaborate, orchestrating multiple agents, with a debug agent that kicks in if the main agent fails. The debug agent analyzes the failure and gives insights to the main agent which tries again, so it’s effectively multiple attempts per problem. If the results can be reproduced “out-of-the-box” with their coding agent like they claim, it puts it up there as one of the top 2-3 CLI agents available right now. |
https://en.wikipedia.org/wiki/Goodhart%27s_law