| 1. Okay with your definition. My point is that you can have the same result with a representation that "closely matches the topology of the relationships being modelled". For example, a representation that "allows relationships between tokens but yet does not care about the meaning or concept not useful to form convincing sentences". And therefore, it means that you can have convincing text without needing a "representation's topology closely matching the topology of the relationships being modelled", and therefore, according to your own definition: no understanding. 2. It is not true I'm avoiding that. I have answered very clearly. 1) GenAI are not trained to get the higher representation of the world, but to get the best convincing sentence generation. This does not require a full world understanding. Worse, once a convincing sentence generation is reached, there is no gain by getting a better world understanding: the training mechanism that pushes into the correct direction stops and therefore it can go into any direction at all. 2) High compactness does not equal best solution. Even humans don't used "high compactness" when doing basic arithmetic, but use "by heart multiplication table". Being compact is useless if it comes with high complexity each time you need to recompute the output. 3) Very very good approximation can reach higher compactness anyway. Your Bezier curves is a good example: real physical phenomenons are almost never the result of a Bezier equation. A Bezier curve did not understood the phenomenon. When it comes to GenAI, it can "fit" the reality with very close precision with several representations, but the majority of the representation corresponds to an incorrect "understanding" of the reality. Another example: if I throw a ball in the air, the motion will be at first order a quadratic equation, plus correction due to friction, wind, ...
If I just "train" something for "throw a ball", this system may fit a quadratic function plus corrections, but they will achieve the same result with Bezier curves, or Fourier series, or additive Gaussian, or ...
But the "understanding" is that the ball is influenced by gravity, which leads to a quadratic equation. The system does not understand that. It has no reason to understand that. And it has no reason to prefer a quadratic equation fit rather than a Bezier fit, on the contrary, the Bezier fit will be more realistic (as the quadratic equation is just the first order approximation). If you want to understand a paper plane trajectory, it is a complex system, and you probably need plenty of parameters to describe the gravity, the wind at each position and each time, the shape of the plane at each time, ... But you can describe the trajectory with just few parameters using a Bezier curve. Train on plenty of paper plane trajectories, and you will have a system that can give you a very realistic paper plane trajectory based on Bezier curve. And yet, your system has no understanding of the paper plane trajectory: it does not know what are the mechanisms that make the paper plane goes up or down. It just creates a realistic trajectory without knowing why this trajectory is realistic, just that this trajectory makes sense based on the other trajectories it has seen. 3. This argument seems to go against your thesis. You are saying that humans, who "understand" + are not even able to have as much conversation as LLM, have way too much neurons. What are these neurons even for then? You are explaining that LLM are just "something different", a reduced mini-version of a brain, and yet you are also saying that they are able to do the complex things the brain do. Another way of seeing it, is that LLM are "dropping" things that they don't need to create convincing sentences, such as "understanding the token". They just "get the Bezier curve fit of the relationship" instead of understanding the real mechanisms and concepts. It's like your Bezier curve example: a system that just creates a realistic paper plane trajectory based on "typical Bezier curve observed during training" will need way less "neurons" than a system that needs to understand the whole aerodynamism of the paper plane. 4. I argue this the same way I say that a system that describe a paper plane trajectory based on best Bezier curves did not understood the mechanism behind how a paper plane trajectory works.
I am not saying "I define 'understanding' as what humans do", I am saying that creating convincing sentences does not require understanding, the same way that generating realistic paper plane trajectories does not require understand gravity, Navier-Stokes equations and Brownian motions. The Bezier curve paper plane trajectory predictor system I have mention, do you think it has understanding of gravity? of Navier-Sotkes? of Brownian motions? No, it has not. You can open this system. It just has Bezier curve for plenty of examples, and thanks to that, it knows that one trajectory is realistic and another is unrealistic. And at some point, it is also able to give realistic trajectories in brand new situations it has never trained on. |
Understanding = Novel Scope * Suitability / Parameter Count.
> My point is that you can have the same result with a representation that "closely matches the topology of the relationships being modelled". For example, a representation that "allows relationships between tokens but yet does not care about the meaning or concept not useful to form convincing sentences".
You are absolutely right, that lack of internal representation-reality correspondence does not rule out real/convincing performance.
> GenAI are not trained to get the higher representation of the world, but to get the best convincing sentence generation.
This is true of all learning. And it will always be the nature of learning.
Which is why performance is always (should be) measured on novel input.
> High compactness does not equal best solution. Even humans don't used "high compactness" when doing basic arithmetic, but use "by heart multiplication table".
This is a really good point!
It brings up the two useful modes of human representation:
(1) The brain's slow mode is very good at handling deeper and deeper layers of representation. When thinking about arithmetic or more complex math analytically, our understanding does follow a path of increasingly deeper representations. And we are very good at applying these deeper understandings.
(2) Then, our fast mode creates shallow representations of things we do frequently.
I would look at this as (1) reflecting scalable understanding (2) reflecting very limited understanding, but scalable speed.
And we often use both modes together.
I would argue that the understanding is primarily in the slow mode. That the fast mode, is the non-understanding but appropriate response mode. And that it operates with a much reduced scope of appropriate response, but a high percentage of applicability. Meaning, most of the time we don't need to use deep understanding we just need fast appropriate response.
But how to compare the two in scopes where they are equally accurate?
I think "high understanding" representations are those very flexible to being used in ways quite different from how they were learned.
Our slow mode does this very well. Our fast mode not so well, but to the degree it generalizes well to novel situations, that would be an increase in understanding.
Our fast system does generalize, but I would argue that at some point it fails, where our slower deeper representations provide the means of analyzing a situation. So it clearly "understands" better.
It is interesting how quickly understanding from our analytical side translates into operation on our fast side. Clearly, our fast side has very efficient access to new "patterns" that our slow side constructs.
> If you want to understand a paper plane trajectory, it is a complex system, and you probably need plenty of parameters to describe the gravity, the wind at each position and each time, the shape of the plane at each time, ... But you can describe the trajectory with just few parameters using a Bezier curve.
I love this example. It does contrast very different kinds of understanding.
(1) Understanding the fundamental reality in which paper planes exist,
(2) Vs. understanding how paper planes behave.
I think my expression works well here, as long as we take "scope" seriously.
Understanding = Novel Scope * Suitability / Parameter Count.
For paper planes as a hobby, a smaller neuron/parameter budget is achieved by learning the emergent laws of paper planes, not their underlying physics. And understanding paper planes is achieved with this smaller budget.
For understanding paper plane dynamics at a design level, a smaller neuron/parameter budget is achieved by learning the underlying physics of aerodynamics at an intuitive level.
For understanding paper plane dynamics at a world class competition level, a smaller neuron/parameter budget is achieved by learning the underlying physics of aerodynamics at an analytical level.
So these would be three different "understandings", each with their own scope and area of appropriate response to novel situations.
Point taken: The most fundamental correspondence isn't the point of a lot of understanding.
You are right, and my equation works, as long "scope" is interpreted to mean appropriate level of interest, not area of fundamental physics involved. Great point.
Does that get us on the same page? Closer?
> I am saying that creating convincing sentences does not require understanding
As problem complexity goes up, there really is an explosive difference between appropriate response via "familiarity" or lower-level fit, vs. higher level fit, for the same number of parameters.
And it is also a dramatically bigger challenge for lower-level fits to respond well to novel stimuli, given the same number of parameters.
The reason is, is that complex problems operate in higher dimensional spaces, and relationships in higher dimensional spaces have exponentially more complexity for any level of representation. Exponentially.
Linear fits of a 2D bezier are inefficient but work. Linear fits for a 100 dimensional bezier, which isn't very many dimensions from a data standpoint, become ludicrously expensive in parameters.
The dimensionality of human communication is probably the most complex problem ever tackled systematically.
I am trying to think of a way to capture this more concretely. I.e. a way to draw a line in this conversation that stands up on its own. All I can point to, is the complete failure of any lower-level fit when done directly, to acheive a trillionth of a trillionth of trillions of the flexibility that SOTA models demonstrate. The extreme dimensionality of input that LLMs respond to, makes my "trillionths" literal in this case. And we do get a concrete measure of the dimensionality within their capacity, as context windows give us live demonstrations of this.
Note that language is literally highly compressed information, with pervasive non-local interactions. The enormous dimensionality is compounded by dense reactivity, pervasive discontinuities. No other informational artifact compares to language complexity.
When I say that this is a case where either real relationships are learned or the model fails, it is because the number of parameters for a lower-level fit really are beyond imagining.
You can't point to any lower-level fit, where the lower-level fit is basic to the fitting algorithm, that ever achieved even a tiny-grammar tiny-subject-scope toy of a toy version, to what LLMs are doing. Nor can I, despite following progress for decades. Nobody can. The original successes of the first LLMs, modest as they appear now, were completely unprecedented.
There just are not enough parameters, by many orders of magnitude, to do language justice over a context window, and respond sensibly to intentionally novel conversations, without identifying the actual relationships behind it.
So that would be my challenge to you. To identify any verifiable lower-level fit that even approximates LLM behavior at the tiniest of toy levels. Verifiable fits at any given level are easy to do, just train a model where the basis is restricted to that kind of fit.
Otherwise, I can agree that understanding is a continuous property, and that how well something understands something, without strict benchmarking by well thought out benchmarks, involves intuition and judgement. So there can be legitimate differences in how we perceive model understanding, in the absence of direct measures.
Any more thoughts? I have understood both myself and your points better as we went along.