Hacker News new | ask | show | jobs
by ai-tamer 53 days ago
Same. The numbers match your feel. Going from 4.6 to 4.7: +14.6 on MCP-Atlas, +10.9 on SWE-bench Pro, tool errors cut by two-thirds. But BrowseComp dropped 4.7 points. Anthropic's own announcement says 4.7 "takes the instructions literally" where 4.6 interpreted them loosely, and recommends re-tuning prompts accordingly. In a conversational loop with an opinionated developer, that translates to less quality because less reasoning — the model executes instead of thinking through. https://llm-stats.com/blog/research/claude-opus-4-7-vs-opus-... https://www.anthropic.com/news/claude-opus-4-7
1 comments

So it became gpt 5.4 xhigh but ten times the cost?
We're rich ;-)

More seriously, in a multi-agent setup the per-token cost matters less: a bit of Claude, a bit of Codex, a bit of Gemini-CLI, ... No single model carries the full bill, and having three different training sets catches more "green tests, wrong code" than any single xhigh pass would. Even at 10x per token, one well-placed Opus in the reviewer seat beats one full Opus session on everything.