|
|
|
|
|
by syntex
371 days ago
|
|
I didn't downvote. T the problem with the paper is that it asks the model to output all moves for, say, 15 disks 2 ^ 15 - 1 = 32767 32767 moves in a single prompt. That's not testing reasoning. That’s testing whether the model can emit a huge structured output without error, under a context window limit. The authors then treat failure to reproduce this entire sequence as evidence that the model can't reason. But that’s like saying a calculator is broken because its printer jammed halfway through printing all prime numbers under 10000. For me o3 returning Python code isn’t a failure. It’s a smart shortcut. The failure is in the benchmark design. This benchmark just smells. |
|
Agreed. But to be fair, 1) a relatively simple algorithm can do it, and more importantly 2) a lot of people are trying to build products around doing exactly this (emit large structured output without error).