| An attempt at a summary of the argument: - Human brains are estimated to have a few hundred trillion synapses. If you tried to replicate this in a neural network model with one parameter per synapse, it would be much larger than the largest models in use today. - Conventional wisdom in form of the Chinchilla scaling law suggests that to train such a gargantuan model, you would need an even more gargantuan training corpus. - But no human has read anywhere near as much as even relatively small Chinchilla-optimal models. In fact, rather than acquiring as much data as possible as efficiently as possible, children might rather rewatch the exact same video for the umpteenth time. When they learn arithmetic, it's from just a paltry few examples provided by the teacher in school. - Large neural networks trained on such little training data would quickly memorize it perfectly and overfit horribly. - Individuals with photographic memory demonstrate that human brains indeed have the memorization capacity you would expect based on synapse count, and appear to show difficulties with generalization as a side-effect. - Speculatively, typical humans forget and generalize instead of memorizing because synaptic strengths are reduced during sleep in an analogue to regularization by weight decay. - Therefore, maybe we should train extremely large models on little data with extremely strong weight decay to counteract memorization, and hope a large learning rate will quickly "catapult" it to a generalizing solution. What I'm missing is a discussion of how much this would cost, even if you handle deployment by distillation into smaller, faster, less data-efficient models. |
Note that LLM parameters don't map to synapses in the same naive way they would for a fully connected network. Each attention parameter is applied thousands or millions of times to the inputs at each inference pass, so it's more like each param might code for a neural circuit repeated thousands of times.
I think of attention as a sort of convolution: in a NN, each convolution kernel gets applied repeatedly to all parts of an image, but in the human visual cortex I imagine these circuits are effectively all separate and parallel. The few parameters of a convolution kernel map to thousands of identical circuits in the visual cortex.