| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by CamperBob2 470 days ago

The author (and Chomsky) fail to understand that LLMs (as well as human brains) are not just autoregressive models, but nonlinear autoregressive models. Put a slightly different way, you can describe LLMs as autoregressive, but only by taking liberties with the classical definition of 'autoregressive.'

The human mind is not, like ChatGPT and its ilk, a lumbering statistical engine for pattern matching, gorging on hundreds of terabytes of data and extrapolating the most likely conversational response or most probable answer to a scientific question. On the contrary, the human mind is a surprisingly efficient and even elegant system that operates with small amounts of information; it seeks not to infer brute correlations among data points but to create explanations. – Noam Chomsky

It's as if Chomsky has either never heard of transformers, or doesn't understand what they do.

Before speaking a sentence, we have a general idea of what we’re going to say; we don’t really choose what to say next based on the last word. That kind of planning isn’t something that can be represented sequentially.

It's as if the author (and Chomsky) has never seen a CoT model in action.

1 comments

Wonderfall 470 days ago

Author here and I welcome the feedback, but I don't really understand your point. My post is clearly not dismissive of efforts to make LLMs reason using CoT prompting techniques and post-training, and I think such efforts are even mentioned. The model remains autoregressive either way, and this reasoning is not some kind of magic that makes them behave differently - these improvements only make them perform (much) better on given tasks.

Additionally, I'm not dismissive of the non-linear nature of transformers which I'm familiar with. Attention mechanism is a lot more complex than a linear relationship between the prediction and the past inputs, yes. But the end result remains sequential prediction. Ironically, diffusion models are kind of the opposite: sequential internally, parallel prediction at each step.

(Note: added note on terminology since the confusion arised by my use of "linearity", which was not referring to the attention mechanism itself. I've read so many papers that are perfectly fine with the use of "autoregressive" for this paradigm that I forgot some people coming from traditional statistics may be confused. Also "based on the last word" was wrong and meant "last words" or "previous words", obviously.)

All that being said, I don't think it's fair to say one doesn't understand how transformers work solely because of semantic interpretation. I appreciate the feedback though!

link

nikhilsimha 467 days ago

Not saying that our current approaches will lead to intelligence. No one can know.

It could very well be that the internal mechanism of our thought has an auto-regressive reasoning component.

With the full system effectively "combining" short term memory (what just happened) and "pruned" long-term memory (what relevant things i know from the past) and pushing that into a RAW autoregressive reasoning component.

It is also possible that another specialized auto regressive reasoning component is driving the "prune" and "combine" operations. This whole system could be solely represented in the larger network.

The argument that "intelligence cannot be auto-regressive" seems to be without basis to me.

> there is strong evidence that not all thinking is linguistic or sequential.

It is possible that a system wrapping a core auto-regressive reasoner can produce non-sequential thinking - even if you don't allow for weight updates.

link

Wonderfall 466 days ago

I completely agree. I never said that "intelligence cannot be auto-regressive", I just questioned whether this can be achieved or not this way. And I don't actually have answers, I just wrote down some thoughts so it would sparkle some interesting discussions about that, and I'm glad it did work (a little) in the end.

I also mentioned that I'm supportive of architectures that will integrate autoregressive components. Totally agree with that.

link