|
|
|
|
|
by XCSme
14 days ago
|
|
On my tests[0] it does a bit worse, and it's almost 2x expensive than Opus 4.7... I was surprised to see that it failed a Data extraction test (it gets it right 2/3 times, but one time it randomly returns null for a value instead). It makes sense a bit that it fails more Trivia/Domain-specific knowledge tasks (I think models are more and more trained towards agentic use-case than general intelligence). [0]: https://aibenchy.com/compare/anthropic-claude-opus-4-7-mediu... |
|
Double-checking my test harness, but it's the first model that does this, so I doubt the issue is on my side...
EDIT: Harness seems correct, for straight coding tasks they perform identical: https://i.snipboard.io/5xbpzY.jpg