| HN Mirror

I am going to tune the expression form of my definition to:

Understanding = Novel Scope * Suitability / Parameter Count.

> My point is that you can have the same result with a representation that "closely matches the topology of the relationships being modelled". For example, a representation that "allows relationships between tokens but yet does not care about the meaning or concept not useful to form convincing sentences".

You are absolutely right, that lack of internal representation-reality correspondence does not rule out real/convincing performance.

> GenAI are not trained to get the higher representation of the world, but to get the best convincing sentence generation.

This is true of all learning. And it will always be the nature of learning.

Which is why performance is always (should be) measured on novel input.

> High compactness does not equal best solution. Even humans don't used "high compactness" when doing basic arithmetic, but use "by heart multiplication table".

This is a really good point!

It brings up the two useful modes of human representation:

(1) The brain's slow mode is very good at handling deeper and deeper layers of representation. When thinking about arithmetic or more complex math analytically, our understanding does follow a path of increasingly deeper representations. And we are very good at applying these deeper understandings.

(2) Then, our fast mode creates shallow representations of things we do frequently.

I would look at this as (1) reflecting scalable understanding (2) reflecting very limited understanding, but scalable speed.

And we often use both modes together.

I would argue that the understanding is primarily in the slow mode. That the fast mode, is the non-understanding but appropriate response mode. And that it operates with a much reduced scope of appropriate response, but a high percentage of applicability. Meaning, most of the time we don't need to use deep understanding we just need fast appropriate response.

But how to compare the two in scopes where they are equally accurate?

I think "high understanding" representations are those very flexible to being used in ways quite different from how they were learned.

Our slow mode does this very well. Our fast mode not so well, but to the degree it generalizes well to novel situations, that would be an increase in understanding.

Our fast system does generalize, but I would argue that at some point it fails, where our slower deeper representations provide the means of analyzing a situation. So it clearly "understands" better.

It is interesting how quickly understanding from our analytical side translates into operation on our fast side. Clearly, our fast side has very efficient access to new "patterns" that our slow side constructs.

> If you want to understand a paper plane trajectory, it is a complex system, and you probably need plenty of parameters to describe the gravity, the wind at each position and each time, the shape of the plane at each time, ... But you can describe the trajectory with just few parameters using a Bezier curve.

I love this example. It does contrast very different kinds of understanding.

(1) Understanding the fundamental reality in which paper planes exist,

(2) Vs. understanding how paper planes behave.

I think my expression works well here, as long as we take "scope" seriously.

Understanding = Novel Scope * Suitability / Parameter Count.

For paper planes as a hobby, a smaller neuron/parameter budget is achieved by learning the emergent laws of paper planes, not their underlying physics. And understanding paper planes is achieved with this smaller budget.

For understanding paper plane dynamics at a design level, a smaller neuron/parameter budget is achieved by learning the underlying physics of aerodynamics at an intuitive level.

For understanding paper plane dynamics at a world class competition level, a smaller neuron/parameter budget is achieved by learning the underlying physics of aerodynamics at an analytical level.

So these would be three different "understandings", each with their own scope and area of appropriate response to novel situations.

Point taken: The most fundamental correspondence isn't the point of a lot of understanding.

You are right, and my equation works, as long "scope" is interpreted to mean appropriate level of interest, not area of fundamental physics involved. Great point.

Does that get us on the same page? Closer?

> I am saying that creating convincing sentences does not require understanding

As problem complexity goes up, there really is an explosive difference between appropriate response via "familiarity" or lower-level fit, vs. higher level fit, for the same number of parameters.

And it is also a dramatically bigger challenge for lower-level fits to respond well to novel stimuli, given the same number of parameters.

The reason is, is that complex problems operate in higher dimensional spaces, and relationships in higher dimensional spaces have exponentially more complexity for any level of representation. Exponentially.

Linear fits of a 2D bezier are inefficient but work. Linear fits for a 100 dimensional bezier, which isn't very many dimensions from a data standpoint, become ludicrously expensive in parameters.

The dimensionality of human communication is probably the most complex problem ever tackled systematically.

I am trying to think of a way to capture this more concretely. I.e. a way to draw a line in this conversation that stands up on its own. All I can point to, is the complete failure of any lower-level fit when done directly, to acheive a trillionth of a trillionth of trillions of the flexibility that SOTA models demonstrate. The extreme dimensionality of input that LLMs respond to, makes my "trillionths" literal in this case. And we do get a concrete measure of the dimensionality within their capacity, as context windows give us live demonstrations of this.

Note that language is literally highly compressed information, with pervasive non-local interactions. The enormous dimensionality is compounded by dense reactivity, pervasive discontinuities. No other informational artifact compares to language complexity.

When I say that this is a case where either real relationships are learned or the model fails, it is because the number of parameters for a lower-level fit really are beyond imagining.

You can't point to any lower-level fit, where the lower-level fit is basic to the fitting algorithm, that ever achieved even a tiny-grammar tiny-subject-scope toy of a toy version, to what LLMs are doing. Nor can I, despite following progress for decades. Nobody can. The original successes of the first LLMs, modest as they appear now, were completely unprecedented.

There just are not enough parameters, by many orders of magnitude, to do language justice over a context window, and respond sensibly to intentionally novel conversations, without identifying the actual relationships behind it.

So that would be my challenge to you. To identify any verifiable lower-level fit that even approximates LLM behavior at the tiniest of toy levels. Verifiable fits at any given level are easy to do, just train a model where the basis is restricted to that kind of fit.

Otherwise, I can agree that understanding is a continuous property, and that how well something understands something, without strict benchmarking by well thought out benchmarks, involves intuition and judgement. So there can be legitimate differences in how we perceive model understanding, in the absence of direct measures.

Any more thoughts? I have understood both myself and your points better as we went along.