|
|
|
|
|
by enum
500 days ago
|
|
It’s not that the benchmark is hard, but that the reasoning models do so much better than the non-reasoning models. That suggests it is testing a capability that reasoning models have that non-reasoning models do not. Getting to 100% may require tokenization innovation, sure. |
|