|
|
|
|
|
by Workaccount2
557 days ago
|
|
>With the discovery that transformers lack reasoning capabilities The only paper I have seen claiming this studied only lightweight open-source models (<27B, mostly 2B and 8B). The also included o1 and 4o for reference, which kind of broke their hypothesis, but they just left that part out of the conclusion. Not even kidding, their graphs show o1 and 4o having strong performance in their benchmarks, but the conclusion just focuses on 2B and 7B models like gemma and qwen. |
|
An 18% drop in accuracy (figure 8) is not insignificant. Even 4o suffered 10% loss (figure 6), and 4o isn't a small llm.
Competent performance should have near zero performance loss. The simplest benchmark merely changes things like "john had 4 apples" to "Mary had 4 oranges." Performance loss due to inconsequential tokens changing is the very definition of over-fitting.