| LLMs are very good at uncovering the mathematical relationships between words, many layers deep. Calling that understanding is a claim about what understanding is. But because we know how the LLMs we're talking about at the moment are trained, it seems to have more problems: LLMs do not directly model the world; they train on and model what people write about the world. It is an AI model of a computed gestalt human model of the world, rather than a model of the world directly. If you ask it a question, it tells you what it models someone else (a gestalt of human writing) is most likely say. That in turn is strengthened if user interaction accepts it and corrected only if someone tells it something different. If we were to define that as what "understanding" is, we would equivalently be saying that a human bullshit artist would have expert understanding if only they produced more believable bullshit. (They also just "try to sound like an expert".) Likewise, I'm not convinced that we can measure its understanding just by identifying inaccuracies or measuring the difference between its answers and expert answers -
There would be no difference between bluffing your way through the interview (relying on your interviewer's limitations in how they interrogate you) and acing the interview. There seems to be a fundamental difference in levels of indirection. Where we "map the territory", LLMs "map the maps of the territory". It can be an arbitrarily good approximation, and practically very useful, but it's a strong ontological step to say one thing "is" another just because it can be used like it. |
This is true. But human brains don't directly model the world either, they form an internal model based on what comes in through their senses. Humans have the advantage of being more "multi-modal," but that doesn't mean that they get more information or better information.
Much of my "modeling of the world" comes from the fact that I've read a lot of text. But of course I haven't read even a tiny fraction of what GPT4 has.
That said, LLMs can already train on images, as GPT4-V does. And the image generators as well do this, it's just a matter of time before the two are fully integrated. Later we'll see a lot more training on video and sound, and it all being integrated into a single model.