| I'm a co-creator of SWE-bench: 1. SWE-bench Verified is now saturated at 93.9% (congrats Anthropic), but anyone who hasn't reached that number yet still has more room for growth. 2. SWE-bench Multilingual and SWE-bench Multimodal (which we'll open source in the next month) are still unsatured. 3. All benchmarks and benchmark paradigms eventually become saturated. That's why the SWE-bench team has worked hard on building the next stage of benchmarks, and we have a few that are already out, for example https://codeclash.ai/ or https://algotune.io/ . And we'll have more to say soon :) |
They're saying:
1. A large number of the tests are inaccurate; so correct solutions will be marked as incorrect.
2. Frontier models have already read and memorized the PR's the problems are based on.
3. In fact, many problems are essentially impossible to get right if you haven't memorized the solution: for example, the test cases will fail if you didn't happen to expose a helper function with a specific name. That name isn't mentioned in the problem; but frontier models are passing that test anyway because they remember that such a helper function is necessary.
If the next stage of benchmarks don't address these issues, they'll continue to have the same problems, saturated or not.