Hacker News new | ask | show | jobs
by ambrozk 1097 days ago
By your definition, is a blind person capable of reasoning about visual data? Is a deaf person capable of reasoning about auditory data? Can a physicist understand the molecules, atoms, & subatomic particles which he or she can only interact with via a fundamentally textual theory? I would submit that there's no fundamental reason why an LLM needs access to more than text to derive human-level world models.

I'm not saying that the current LLMs have derived human-level world models (they haven't). It's just that, to me, the theory that textual data is categorically not enough to do so is necessarily empirical. To back up the assertion, you'd need to construct metrics which present text-only LLMs fail to succeed with, and then you need to show how multi-modal LLMs did succeed with those same metrics. So far, I don't think adding multi-modality to LLMs actually has improved their general-purpose reasoning ability, which I consider evidence against this theory. But then I read people online just asserting it as though it's an obvious truth derivable from philosophical first-principles. It's odd to me.