| HN Mirror

Yes, it seems the open benchmark results that are normally reported, such as SWE-bench, SWE-bench Verified, and Terminal-bench, aren't really that indicative of success in more general use cases.

According to Gemini, SWE-bench is actually a very narrow test, consisting of fixing GitHub issues drawn from 12 large Python projects (with Verified being a curated subset of that), and Terminal-bench (basically agentic computer tool use) is more focused on general case rather than use of the tools used by a typical coding agent such as Claude Code, Codex CLI or Gemini CLI.