Hacker News new | ask | show | jobs
by rthnbgrredf 500 days ago
I'm still not convinced that this isn't a tokenizer issue.

Were you able to find a substantial number of questions that do not fall into the letter countinh or word shuffling domsin - problems that are clearly unrelated to the fundamental tokenizer issue of modern LLMs? Otherwise, I would argue that your paper simply proves that the issue still exists.

1 comments

It’s not that the benchmark is hard, but that the reasoning models do so much better than the non-reasoning models. That suggests it is testing a capability that reasoning models have that non-reasoning models do not.

Getting to 100% may require tokenization innovation, sure.