Hacker News new | ask | show | jobs
by enum 500 days ago
It’s not that the benchmark is hard, but that the reasoning models do so much better than the non-reasoning models. That suggests it is testing a capability that reasoning models have that non-reasoning models do not.

Getting to 100% may require tokenization innovation, sure.