|
|
|
|
|
by nextaccountic
82 days ago
|
|
It's hard to trust public, high profile benchmarks because any change to a specific model (Opus 4.5 in this case) can be rejected if they have regressions on SWE-Bench-Pro, so everything that gets to be released would perform well in this benchmark |
|