Hacker News new | ask | show | jobs
by zamalek 559 days ago
https://arxiv.org/abs/2410.05229

An 18% drop in accuracy (figure 8) is not insignificant. Even 4o suffered 10% loss (figure 6), and 4o isn't a small llm.

Competent performance should have near zero performance loss. The simplest benchmark merely changes things like "john had 4 apples" to "Mary had 4 oranges." Performance loss due to inconsequential tokens changing is the very definition of over-fitting.

4 comments

I just don't see how anyone can see a study comparing the reasoning abilities of various LLMs, see that large LLMs have better reasoning abilities and conclude that LLMs can't reason. LLMs don't have human-like reasoning abilities, but it's just obviously true that they have some capacity for reasoning; that ability seems to scale roughly linearly with model size and training FLOPs.
Yes, but is human-reasoning on the same spectrum as LLM-reasoning? Meaning that only scale will turn the latter into the former?

No definitive answer yet, but my bet is on no.

Agreed, and I think the answer is pretty clear.

Large models successful now have dodged recurrent architecture, which is harder to train but allows for open ended inference steps, which would allow straightforward scaling to any number of reasoning steps.

At some point, recurrent connections are going to get re-incorporated into these models.

Maybe two stage training. First stage, learn to integrate as much information as well as possible, without recurrence. As is happening now. Second training stage, embed that model in a larger iterative model, and train for variable step reasoning.

Finally, successful iterative reasoning responses can be used as further examples for the non-iterative module.

This would be similar to how we reason in steps at first, in unfamiliar areas. But quickly learn to reason with faster direct responses, as we gain familiarity.

We continually fine tune our fast mode on our own more powerful slow mode successes.

Lol, imagine being downvoted for asking a couple of questions.

Still 5k points to go, though! :D

It's clear though that as the models get bigger and more advanced, their "reasoning" benchmark results improve. The conclusion though just focuses on the bottom tier models. The fact they even set out to create an LLM benchmark and only focus on bottom tier models itself is ridiculous.

The authors did the equivalent of "Lets design a human intelligence benchmark, and use a bunch of 12 year olds as reference points"

I will eat my hat if the authors rescind the paper in a year or so if their benchmarks show no difference on SOTA models.

>The simplest benchmark merely changes things like "john had 4 apples" to "Mary had 4 oranges."

Those models (4o, o1-mini, preview) don't see any drop at all on those benchmarks. The only benchmark that see drops with the SOTA models is the one they add, "seemingly relevant but ultimately irrelevant information".

Humans can and do drop in performance when presented with such alterations. Are they better than LLMs in that case ? Who knows ? Because these papers don't bother testing human baselines.

Has anyone done this sort of test on people?