Hacker News new | ask | show | jobs
by xpuente 557 days ago
The issue is that no one fully understands why synaptic pruning occurs in biology. Large language models have no direct connection to biological systems, and pruning in LLMs is no exception.
3 comments

A number of things that work for biological systems (humans) work for LLMs too:

- after the answer, ask it "are you sure?" (from the office tv series: "is it a stupid thing to do? if it is, don't do it") - chain of thought, step-by-step thinking - different hats (godfather style: piecetime vs. wartime consigliere): looking at the problem from different points of view (at the same time or in stages). For example, first draft: stream of consciousness answer, second iteration: critic/editor/reviewer (produces comments), third (address comments), repeat for some time - collaborative work of different experts(MoE), delegate specific tasks to specialists - [deliberate] practice with immediate feedback

In ANNs pruning helps prevent over-fitting. With the discovery that transformers lack reasoning capabilities this research really comes at a great time. It's a miniscule chance, but we might see this improve performance over the long term and further research.
>With the discovery that transformers lack reasoning capabilities

The only paper I have seen claiming this studied only lightweight open-source models (<27B, mostly 2B and 8B). The also included o1 and 4o for reference, which kind of broke their hypothesis, but they just left that part out of the conclusion. Not even kidding, their graphs show o1 and 4o having strong performance in their benchmarks, but the conclusion just focuses on 2B and 7B models like gemma and qwen.

https://arxiv.org/abs/2410.05229

An 18% drop in accuracy (figure 8) is not insignificant. Even 4o suffered 10% loss (figure 6), and 4o isn't a small llm.

Competent performance should have near zero performance loss. The simplest benchmark merely changes things like "john had 4 apples" to "Mary had 4 oranges." Performance loss due to inconsequential tokens changing is the very definition of over-fitting.

I just don't see how anyone can see a study comparing the reasoning abilities of various LLMs, see that large LLMs have better reasoning abilities and conclude that LLMs can't reason. LLMs don't have human-like reasoning abilities, but it's just obviously true that they have some capacity for reasoning; that ability seems to scale roughly linearly with model size and training FLOPs.
Yes, but is human-reasoning on the same spectrum as LLM-reasoning? Meaning that only scale will turn the latter into the former?

No definitive answer yet, but my bet is on no.

Agreed, and I think the answer is pretty clear.

Large models successful now have dodged recurrent architecture, which is harder to train but allows for open ended inference steps, which would allow straightforward scaling to any number of reasoning steps.

At some point, recurrent connections are going to get re-incorporated into these models.

Maybe two stage training. First stage, learn to integrate as much information as well as possible, without recurrence. As is happening now. Second training stage, embed that model in a larger iterative model, and train for variable step reasoning.

Finally, successful iterative reasoning responses can be used as further examples for the non-iterative module.

This would be similar to how we reason in steps at first, in unfamiliar areas. But quickly learn to reason with faster direct responses, as we gain familiarity.

We continually fine tune our fast mode on our own more powerful slow mode successes.

Lol, imagine being downvoted for asking a couple of questions.

Still 5k points to go, though! :D

It's clear though that as the models get bigger and more advanced, their "reasoning" benchmark results improve. The conclusion though just focuses on the bottom tier models. The fact they even set out to create an LLM benchmark and only focus on bottom tier models itself is ridiculous.

The authors did the equivalent of "Lets design a human intelligence benchmark, and use a bunch of 12 year olds as reference points"

I will eat my hat if the authors rescind the paper in a year or so if their benchmarks show no difference on SOTA models.

>The simplest benchmark merely changes things like "john had 4 apples" to "Mary had 4 oranges."

Those models (4o, o1-mini, preview) don't see any drop at all on those benchmarks. The only benchmark that see drops with the SOTA models is the one they add, "seemingly relevant but ultimately irrelevant information".

Humans can and do drop in performance when presented with such alterations. Are they better than LLMs in that case ? Who knows ? Because these papers don't bother testing human baselines.

Has anyone done this sort of test on people?
A vocal minority of researchers are essentially human chauvinists --- they "want to believe" that LLMs can't "really" perform this or that part of cognition even though the evidence is blinding that they can. (Anyone who genuinely believes that LLMs can't reason at all has never used an LLM.) These researchers start with their conclusion and work backwards to an argument, making their work seductive but useless.
The problem is in being able to discern reasoning from patterns that happen to exist in the training data. There are plenty of tricks you can play on an LLM by subverting the expectations it must necessarily have due to its training data. A human might fall into the same trap, but can then reason themselves out of it, whereas an LLM tends to double down on its mistake.
So you are saying that LLM do can reasoning? Logical reasoning is something completely else than likelyhood in word completion. A pure LLM will never be able to do reasoning, you need a hybrid. Use the LLM for classification and completion and a logic system for reasoning
Really? It seems obvious to me.

During the learning stage we want input from every variable so that we are sure that we don't omit a variable that turns out to be essential for the calculation. However in any calculation a human does 99.9999% of variables are irrelevant (e.g. what day of the week it is, am I sleepy, etc), so of course the brain wouldn't use resources to keep connections that aren't relevant to a given function. Imagine what a liability it would be if we have had excessive direct connections from our visual processing system to the piece of our brain that controls heartrate.

We can convince ourselves of a lot of things that 'seem obvious'. The pesky thing is that sometimes those obvious facts have the temerity to be untrue. That's why we try to understand systems instead of believing obvious things.
As far as I know, pruning is related to age. At birth, we have a massive number of silent synapses. As we grow older, those that remain unused (i.e., inactive) tend to disappear. This process involves a delicate mechanism, including components of the immune system.

The unfortunate reality is that no one truly understands how memory works. Many theories are floating around, but the fundamental components remain elusive. One thing is certain: it is quite different from backpropagation. Thankfully, our brains do not suffer from catastrophic forgetting.