Hacker News new | ask | show | jobs
by yorwba 11 days ago
An attempt at a summary of the argument:

- Human brains are estimated to have a few hundred trillion synapses. If you tried to replicate this in a neural network model with one parameter per synapse, it would be much larger than the largest models in use today.

- Conventional wisdom in form of the Chinchilla scaling law suggests that to train such a gargantuan model, you would need an even more gargantuan training corpus.

- But no human has read anywhere near as much as even relatively small Chinchilla-optimal models. In fact, rather than acquiring as much data as possible as efficiently as possible, children might rather rewatch the exact same video for the umpteenth time. When they learn arithmetic, it's from just a paltry few examples provided by the teacher in school.

- Large neural networks trained on such little training data would quickly memorize it perfectly and overfit horribly.

- Individuals with photographic memory demonstrate that human brains indeed have the memorization capacity you would expect based on synapse count, and appear to show difficulties with generalization as a side-effect.

- Speculatively, typical humans forget and generalize instead of memorizing because synaptic strengths are reduced during sleep in an analogue to regularization by weight decay.

- Therefore, maybe we should train extremely large models on little data with extremely strong weight decay to counteract memorization, and hope a large learning rate will quickly "catapult" it to a generalizing solution.

What I'm missing is a discussion of how much this would cost, even if you handle deployment by distillation into smaller, faster, less data-efficient models.

2 comments

> Human brains are estimated to have a few hundred trillion synapses. If you tried to replicate this in a neural network model with one parameter per synapse...

Note that LLM parameters don't map to synapses in the same naive way they would for a fully connected network. Each attention parameter is applied thousands or millions of times to the inputs at each inference pass, so it's more like each param might code for a neural circuit repeated thousands of times.

I think of attention as a sort of convolution: in a NN, each convolution kernel gets applied repeatedly to all parts of an image, but in the human visual cortex I imagine these circuits are effectively all separate and parallel. The few parameters of a convolution kernel map to thousands of identical circuits in the visual cortex.

A biological synapse's weight takes effect whenever its input changes. So although it cannot be copied and applied in parallel to different inputs at the same time (and hence your visual cortex has a bunch of more-or-less identical edge-detection circuits) it can still be applied sequentially to different inputs at different times. And when LLMs do operate in sequential mode, generating tokens one at a time, they typically access each parameter at most once per forward pass.

Though there are things like looped transformers that reuse the same parameters multiple times even for a single token, so maybe those will finally give us AGI if scaled up to a trillion parameters and looped hundreds of times. (Sounds expensive!)

> A biological synapse's weight takes effect whenever its input changes.

I don't think it makes sense to try to compare our brains to ANN's, they are apples and oranges.

A synapse's weight is dynamically modulated by the astrocyte on multiple time scales (millisecond, sub-second, minutes), and the astrocyte itself is receiving inputs and performing computation (in addition to impacting the neural network).

> I don't think it makes sense to try to compare our brains to ANN's, they are apples and oranges.

It makes perfect sense to compare them. There are clear similarities in the style of processing. And I rarely, if ever, see people over interpreting comparisons.

--

The above insight: That convolution in a model has a not-the-same but still related relationship to living neurons isn't nonsense. In both cases, parameters are not just being used once in a given short-term response, even though the specifics of reuse are different.

And the relationship can be stronger: There is a lot of evidence convolution does effectively happen in the brain, via similar operations occurring across a region of similarly organized neurons, instead of via "reused" neurons/parameters. I.e. lots of regularity in the visual system's early processing.

Other things I find interesting: Human neurons are very noisy and statistical, but some of that gets smoothed by soma integration. So there is a loose correspondence with the sigmoid function, with biology encoding by frequency instead of amplitude.

Also, the branched dendritic trees of live neuron's are not passive, they can have apparently active aggregation points. Which makes a human neuron more comparable to a neuron with multiple feeder neurons. I.e. a very small two-layer net. And it adds the possibility of tunable "parameters" within the dendrite tree, in addition to synapse strengths.

The contrast of gradient algorithms, vs. whatever algorithms human cells learn with, is really interesting. We know a little about how one neuron learns, but not much at all about how organized neurons learn together. In this case, comparison is fruitful for the contrast it highlights.

The biological neuron as a little-two-layer net model, suggests that perhaps learning operates at multiple levels in a single neuron. I.e. "two-layer" learning rules.

>But no human has read anywhere near as much as even relatively small Chinchilla-optimal models

They're missing that humans don't consume raw text. They consume non-stop high resolution, high FPS audio and video imagery. If you tokenized the input to human eyes and ears in the first few years of life, that's more data than even the largest LLMs are trained on.

I didn't include it in my summary (it took me an hour to read the whole thing, obviously a lot had to be cut) but the article does actually address the "high resolution" argument in a three-paragraph bullet point under the "Sample Inefficiency" subheading: https://gwern.net/llm-catapult#sample-inefficiency If you read it on a 4K screen at 120 FPS, you should be able to take in its information content in less than a microsecond.
They "address" it by making false statement that the video stream is highly predictable. Sure, you might be able to predict 99% of video stream (for which you'd need to have a physics model, negating the whole point of baby fast learning), but the remaining 1% is still in terabytes if not petabytes per year.
I think this is addressed in the blog post:

  And on the human side, disabled people are not much less intelligent than normal humans: deaf/blind people are much worse at language tasks, but their fluid intelligence often remains normal. If the sensory bandwidth were so critical, this would be impossible.