| > We already know LLMs are more than just words, Just because you say something doesn’t mean it’s true. They are literally next token prediction machines normally trained on just text tokens. All they know is words. It happens that we humans encode and assign a lot of meaning in words and their semantics. LLMs can replicate combinations of words that appear to have this intent and understanding, even though they literally can’t, as they were just statistically likely next tokens. (Not that knowing likely next tokens isn’t useful, but it’s far from understanding) Any assignment of meaning, reasoning, or whatever that we humans assign is personification bias. Machines designed to spit out convincing text successfully spits out convincing text and now swaths of people think that more is going on. I’m not as well versed on multimodal models, but the ideas should be consistent. They are guessing statistically likely next tokens, regardless of if those tokens represent text or audio or images or whatever.
Not useless at all, but not this big existential advancement some people seem to think it is. The whole AGI hype is very similar to “theory of everything” hype that comes and goes now and again. |
And in order to predict the next token well they have to build world models, otherwise they would just output nonsense. This has been proven [1].
This notion that just calling them "next token predictors" somehow precludes them being intelligent is based on a premise that human intelligence cannot be reduced to next token prediction, but nobody has proven any such thing! In fact, our best models for human cognition are literally predictive coding.
LLMs are probably not the final story in AGI, but claiming they are not reasoning or not understanding is at best speculation, because we lack a mechanistic understanding of what "understanding" and "reasoning" actually mean. In other words, you don't know that you are not just a fancy next token predictor.
[1] https://arxiv.org/abs/2310.02207