|
|
|
|
|
by ankit219
376 days ago
|
|
The paper is sloppy. The original point may have credence[1], but the way they went about showing it is borderline irresponsible. The first aspect is how they conflated the number of steps involved with difficulty level. Not even considering the solution space. Then, the solutions are long, models are trained to keep the answers concise, and they are measuring consistency across tries. (Eg: Tower of hanoi for 13 steps needs 80k tokens to just blurt out the answer. The model already knows there is literally one way to solve it - ergo search space is not that big - but the paper shows that it is not reasoning. (ofc it isnt, since the sonnet64k would run out of tokens even without reasoning). Then, you have the scenario when even a 0.999 accurate llm would mess up one token and goes wrong on one run. They cited that as an example of how LLMs get it wrong and conclude its memorization and pattern matching and not reasoning. Real world data and usage does not correspond to that. [1]: Anthropic found that reasoning is not 100% accurate. Thats the premise of the paper, just the headline is super clickbaity. |
|