|
|
|
|
|
by itay-maman
133 days ago
|
|
The inherent problem with evaluating coding performance of models remains: most day-to-day coding tasks are open-ended/partially-spec'd, and as such there is huge uncertainty on how the "right" solution looks. It's very hard to rank models' solutions on such problems, which is why they rarely appear in benchmarks (I'd be glad to stand corrected). Even Opus 4.5 coding a C compiler from scratch - jaw-dropping as it is - doesn't tell the whole story. Most of my tasks are not that well spec'd. |
|
According to Gemini, SWE-bench is actually a very narrow test, consisting of fixing GitHub issues drawn from 12 large Python projects (with Verified being a curated subset of that), and Terminal-bench (basically agentic computer tool use) is more focused on general case rather than use of the tools used by a typical coding agent such as Claude Code, Codex CLI or Gemini CLI.