Hacker News new | ask | show | jobs
by abeppu 1107 days ago
I disagree. The criticism is _not_ that basic building blocks cannot be combined to produce something richer. The issue is the "without any reference to meaning" part of the quoted definition from Bender in that article. Models which are _only_ trained on text do not have a grounding to relate linguistic forms to anything else. When you know what an apple is, it's in part because you've seen and touched and tasted and eaten one. The model only knows how people talk about apples, and which texts are plausible, but not which ones are true.

But we're already getting past this with multi-modal models! Some really great work is being done which ties language processing with visual perception and in some cases robot action planning. A model can know how we talk about apples, can see where an apple is in a scene, can navigate to and retrieve an apple, etc. This lets us get at truth ("Is the claim 'the apple is on the book' true of this scene?") in a way which text-only models fundamentally cannot have. The point is, the way you get past the "stochastic parrot" phase requires qualitative structural changes to incorporate different kinds of information -- not just scaling up text-only models.

> They can't prove I'm not a stochastic parrot anymore than they can prove whatever cutting edge LLM isn't.

I can't prove you're not a stochastic parrot by only talking to you via text. But in person I can toss you an object and you can catch it which shows that you understand how to interact with a dynamic 3D environment. I can ask you a question about something in our shared environment, and you can give an answer which is _true_, rather than which is a plausible-sounding sentence. This is the difference between knowing what English texts or English conversations look like, versus knowing what states of the world are referred to by statements.

3 comments

By your definition, is a blind person capable of reasoning about visual data? Is a deaf person capable of reasoning about auditory data? Can a physicist understand the molecules, atoms, & subatomic particles which he or she can only interact with via a fundamentally textual theory? I would submit that there's no fundamental reason why an LLM needs access to more than text to derive human-level world models.

I'm not saying that the current LLMs have derived human-level world models (they haven't). It's just that, to me, the theory that textual data is categorically not enough to do so is necessarily empirical. To back up the assertion, you'd need to construct metrics which present text-only LLMs fail to succeed with, and then you need to show how multi-modal LLMs did succeed with those same metrics. So far, I don't think adding multi-modality to LLMs actually has improved their general-purpose reasoning ability, which I consider evidence against this theory. But then I read people online just asserting it as though it's an obvious truth derivable from philosophical first-principles. It's odd to me.

> I disagree. The criticism is _not_ that basic building blocks cannot be combined to produce something richer. The issue is the "without any reference to meaning" part of the quoted definition from Bender in that article.

Right. People think the stochastic parrot description is about the Chinese Room thought experiment, but it's not. It's about the Thai Library thought experiment: https://medium.com/@emilymenonbender/thought-experiment-in-t...

Thanks for pointing to that! I'm weirded out b/c this article from Bender in late May seemed so familiar. Here's a conversation from Feb in which a very similar argument is made, also using Thai text as an example: https://news.ycombinator.com/item?id=34732971
You can say the same about humans, we only experience an approximation of the real world via our senses, never the “real thing”, so can we “truly understand” it? Yes, in the sense that we can reason about it and make and test predictions about the parts we can understand. The world we experience is based on our senses, and that’s what what we understand. A LLM’s world is text, and there’s no reason it doesn’t “truly understand” the concepts that it’s using any less than humans do