| Not really about Theory of Mind, but in the same line, I remember the other day someone argued with me that LLMs model the world, rather than just modelling language (that may represent the world). I kept thinking about that problem and plausible experiments to show my point that LLMs are dumb about the physical world, even if they know perfectly how it works in terms of language/representation. So I thought, what happens if I give an LLM an image and I ask a representation of said image in ASCII art (obviously without relying in Python and the trivial pixel intensity to character transform it usually proposes). Remember: - LLMs should've been trained with a lot of RGB image training data with associated captions => So it should understand images very well. - LLMs should've been trained with a lot of ASCII training data with associated captions => So it should draw/write ASCII like an expert. Plus, it understands vision apparently (managed as tokens), so it should do well. But it can't do a decent translation that captures the most interesting features of an image into ASCII art (I'm pretty sure a human with an hour of time should be able to do it, even if its awful ASCII art). For example, I uploaded an image macro meme with text and two pictures of different persons kind of looking at each other. The ASCII art representation just showed two faces, that didn't look at each other but rather away from each other. It just does not "understand" the concept of crossing sights (even if it "understands" the language and even image patches when you ask about where are they looking at, it will not draw that humanly important stuff by itself). These things just work with tokens, and that is useful and seems like magic in a lot of domains. But there is no way in hell we are going to get into AGI without a fully integrated sensor platform that can model the world in its totality including interacting with it (i.e. like humans in training, but not necessarily in substrate nor training time hopefully). And I really don't know how something that has a very partial model of the world can have a Theory of Mind. |
However, ask a Generative Adversarial Network for ASCII, you'll get what you expect. Absent the infra-word character cohesion that LLM's token-ization provides, it will give realistic, if sometimes "uncanny" images - ones that "make sense" sequentially, or in the short term, but not the longer, or larger context.
The language portion of your brain, that works faster than you do - else you would be at a loss of words constantly - is not nearly as equipped to deal with spatial problems that your posterior parietal cortex is.
Ultimately we are converging towards a Mixture-of-Experts model that we will one day realize is just....us, but better.