|
|
|
|
|
by CamperBob2
470 days ago
|
|
The author (and Chomsky) fail to understand that LLMs (as well as human brains) are not just autoregressive models, but nonlinear autoregressive models. Put a slightly different way, you can describe LLMs as autoregressive, but only by taking liberties with the classical definition of 'autoregressive.' The human mind is not, like ChatGPT and its ilk, a lumbering statistical engine for pattern matching, gorging on hundreds of terabytes of data and extrapolating the most likely conversational response or most probable answer to a scientific question. On the contrary, the human mind is a surprisingly efficient and even elegant system that operates with small amounts of information; it seeks not to infer brute correlations among data points but to create explanations. – Noam Chomsky It's as if Chomsky has either never heard of transformers, or doesn't understand what they do. Before speaking a sentence, we have a general idea of what we’re going to say; we don’t really choose what to say next based on the last word. That kind of planning isn’t something that can be represented sequentially. It's as if the author (and Chomsky) has never seen a CoT model in action. |
|
Additionally, I'm not dismissive of the non-linear nature of transformers which I'm familiar with. Attention mechanism is a lot more complex than a linear relationship between the prediction and the past inputs, yes. But the end result remains sequential prediction. Ironically, diffusion models are kind of the opposite: sequential internally, parallel prediction at each step.
(Note: added note on terminology since the confusion arised by my use of "linearity", which was not referring to the attention mechanism itself. I've read so many papers that are perfectly fine with the use of "autoregressive" for this paradigm that I forgot some people coming from traditional statistics may be confused. Also "based on the last word" was wrong and meant "last words" or "previous words", obviously.)
All that being said, I don't think it's fair to say one doesn't understand how transformers work solely because of semantic interpretation. I appreciate the feedback though!