| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by enum 500 days ago
	It’s not that the benchmark is hard, but that the reasoning models do so much better than the non-reasoning models. That suggests it is testing a capability that reasoning models have that non-reasoning models do not. Getting to 100% may require tokenization innovation, sure.