|
|
|
|
|
by abeppu
1107 days ago
|
|
I disagree. The criticism is _not_ that basic building blocks cannot be combined to produce something richer. The issue is the "without any reference to meaning" part of the quoted definition from Bender in that article. Models which are _only_ trained on text do not have a grounding to relate linguistic forms to anything else. When you know what an apple is, it's in part because you've seen and touched and tasted and eaten one. The model only knows how people talk about apples, and which texts are plausible, but not which ones are true. But we're already getting past this with multi-modal models! Some really great work is being done which ties language processing with visual perception and in some cases robot action planning. A model can know how we talk about apples, can see where an apple is in a scene, can navigate to and retrieve an apple, etc. This lets us get at truth ("Is the claim 'the apple is on the book' true of this scene?") in a way which text-only models fundamentally cannot have. The point is, the way you get past the "stochastic parrot" phase requires qualitative structural changes to incorporate different kinds of information -- not just scaling up text-only models. > They can't prove I'm not a stochastic parrot anymore than they can prove whatever cutting edge LLM isn't. I can't prove you're not a stochastic parrot by only talking to you via text. But in person I can toss you an object and you can catch it which shows that you understand how to interact with a dynamic 3D environment. I can ask you a question about something in our shared environment, and you can give an answer which is _true_, rather than which is a plausible-sounding sentence. This is the difference between knowing what English texts or English conversations look like, versus knowing what states of the world are referred to by statements. |
|
I'm not saying that the current LLMs have derived human-level world models (they haven't). It's just that, to me, the theory that textual data is categorically not enough to do so is necessarily empirical. To back up the assertion, you'd need to construct metrics which present text-only LLMs fail to succeed with, and then you need to show how multi-modal LLMs did succeed with those same metrics. So far, I don't think adding multi-modality to LLMs actually has improved their general-purpose reasoning ability, which I consider evidence against this theory. But then I read people online just asserting it as though it's an obvious truth derivable from philosophical first-principles. It's odd to me.