| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ai-tamer 100 days ago
	Same. The numbers match your feel. Going from 4.6 to 4.7: +14.6 on MCP-Atlas, +10.9 on SWE-bench Pro, tool errors cut by two-thirds. But BrowseComp dropped 4.7 points. Anthropic's own announcement says 4.7 "takes the instructions literally" where 4.6 interpreted them loosely, and recommends re-tuning prompts accordingly. In a conversational loop with an opinionated developer, that translates to less quality because less reasoning — the model executes instead of thinking through. https://llm-stats.com/blog/research/claude-opus-4-7-vs-opus-... https://www.anthropic.com/news/claude-opus-4-7

1 comments

siva7 99 days ago

So it became gpt 5.4 xhigh but ten times the cost?

link

ai-tamer 99 days ago

We're rich ;-)

More seriously, in a multi-agent setup the per-token cost matters less: a bit of Claude, a bit of Codex, a bit of Gemini-CLI, ... No single model carries the full bill, and having three different training sets catches more "green tests, wrong code" than any single xhigh pass would. Even at 10x per token, one well-placed Opus in the reviewer seat beats one full Opus session on everything.

link