| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Nevermark 9 days ago

I am not sure what you mean by complex combinatorial. If we are talking about combinatorial, its combinatorial. N can be very large, but it is going to scale like combinatorial, not something else.

I just started out with mapping to be systematic. Mapping is ground zero, then interpolation i.e. any smooth fitting function or basis, then combinatorial where different bases are recognized and then project relative to their relevance to a new input.

Each of those increase modeling efficiency and power, but even combinatorial doesn't scale to problems like language.

I may be doing a poor job communicating. A formal breakdown of the scaling issues with lower order, but scaled to make up for it, modeling would be a great paper.

To prove me wrong (as a thought experiment), choose a lower order model, any kind you can imagine that would qualify as modeling without understanding. Demonstrate it can do anything close. That it could possibly scale to the human corpus with just a trillion parameters.

If it the number of parameters goes up far too fast, then that can't be the way deep learning solves the problem with a trillion, or a few billion, either.

And consider the other side. We have no idea how our own brains are lifting up what is relevant vs. what is not. We are used to it happening. We call it "understanding". But we don't know how it works, how we work. Despite experiencing it.

What we do know, because combinatorial is too resource intensive, is we are not just combinatorial either.

1 comments

cauch 9 days ago

> I am not sure what you mean by complex combinatorial. If we are talking about combinatorial, its combinatorial. N can be very large, but it is going to scale like combinatorial, not something else.

The way a LLM works is by creating a space of N dimensions, N being the number of token. This space contains all the possible combinations. The LLM will find the best combination, but will not scan the whole space. To find the best combination, it will minimize the loss function, which is low when the output corresponds to the target. By doing so, it will not explore the combination that "goes in the wrong direction", and therefore it is not true to say that increasing the space as a scale S corresponds to increasing the difficulty of running the model by a scale S.

Because of that, while the combination space scales like combinatorial, the model does not. A model with 2 weights (or rather tokens, but the number of weights should be at least the number of tokens) corresponds to 4 combinations (AA, AB, BA, BB can indeed be described by 2 binary weights of value "A" or "B"). A model with 3 weights corresponds to 9 combinations. A model with 4 weights corresponds to 16 combinations. ... A model with N weights corresponds to N to the power N combinations. The number of combination increases a lot, and yet the number of weights increase linearly.

In SOTA, we have billions of weights. That is a model that contains a very very very very big number of combinations, something so big that it is difficult to understand for a human. It will not try all of these combination one by one, the gradient descend method will help it finding the best combination without having to do so.

So, yes, SOTA are finding "the best combination" amongst an impressively huge number of combinations, yet without having to "scale like combinatorial".

> To prove me wrong (as a thought experiment), choose a lower order model, any kind you can imagine that would qualify as modeling without understanding. Demonstrate it can do anything close. That it could possibly scale to the human corpus with just a trillion parameters.

Yes. Easy. A SOTA LLM does that. It is a modeling without understanding. It does not understand, it finds the best patterns. And when you put it in a new situation, it uses these patterns to create a new text, without truly understanding the content of the text. And if you ask an additional question, it will use the previous text as context, and create a new text that, as it has been trained to, will be consistent with the output that has been given.

Your assertion "you can prove me wrong" is a circular reasoning: you start saying "if a model can do a text that looks realistic to me, then it means it has understanding. To prove me wrong, give me a text that looks realistic to me and has no understanding". Well, I cannot do that, because for you, if it looks realistic, it has to have understanding.

> If it the number of parameters goes up far too fast, then that can't be the way deep learning solves the problem with a trillion, or a few billion, either.

The combination space grows as N to the power N. So, a trillion parameters is not "just 1000 times bigger" than a billion parameters, but more than 1000 to the power of one billion bigger (the exact value is often even bigger than that). Do you realise the size of the combination space? That is 1 followed by 3 times one billion zeroes.

> What we do know, because combinatorial is too resource intensive, is we are not just combinatorial either.

I think you don't understand how LLM works: the find the best combinations in a incredibly huge parameter space, but don't need to explore the whole space, just the 1-dimension manifold that is the curve that follow the gradient descend within this huge combination space.

There are plenty of clues that SOTA don't "understand". For example, did you notice that SOTA happens to understand what human understand, and don't understand what human don't understand. If indeed the way SOTA works would be by "discovering the true mechanism", it means that it would discover with equal probability mechanisms that humans have already noticed and mechanisms that humans have not already noticed yet. For example, humans know that the Standard Model of particle physics is incomplete, and there are plenty of texts and books about that that the SOTA learnt about. Yet, SOTA did not "understood" the underlying mechanism that explain particle physics. It does not really know what an electron is by "making sense of what this object does", it only knows it as "a language word that can be used in some context in a specific way".

And, sure, SOTA is helping with new discoveries, but the way it does it is by using "reasoning" approach. If indeed SOTA creates its own understanding when learning the human language, then it should have the new discovery after the learning, without using any "reasoning" approach, because it would be something that it has already understood.

link

Nevermark 8 days ago

> Well, I cannot do that, because for you, if it looks realistic, it has to have understanding.

Yes, if it consistently produces good output for highly varied stimuli that can be intentionally picked to have been unlikely to ever had obvious representation in the training set, then yes it understands.

I think we are talking past each other a bit.

A series of increasingly challenging datasets, used to capture scaling efficiencies, would ground our discussion.

But the level of performance for models is simply too good vs. the number of parameters to be doing anything trivial.

Deep learning models do something combinatorial models do not. The linear tensor + non-linear transforms do two special things:

1. The tensor itself just projects a linear space into higher dimensions, but its still the same information space. Project a 2D surface into higher dimensions linearly, and there can be more parameters, but it is not more information, since there is an expansion of linear dependence to match.

2a. But then the nonlinear both (a) thresholds, squashes or otherwise alters the linear results, in a way that removes linear dependencies, increasing the useful dimensionality of the representation.

2b. And the squashing also allows dimensions to be folded down.

So by both expanding and flattening representational dimensions, deep learning models are able to model higher-order relationship directly, that any less expressive modeling would require cobbling together many patches of fitting.

Another way to put this, is deep learning models are able to learn higher-order relationships directly, not be memorizing and interpolating across learned points or regions.

So a dramatically greater ability to "understand" is why deep learning models are so much better. They are not doing simple combinatorial fitting.

"Understanding" or not, combinatorial relationships are the low bar for deep learning models, they are inherently great a learning much higher-order relationships.

I am falling asleep at this point. I feel like we need a blackboard and a computer. You are saying a lot of things that make me think, and make sense to me.

link

cauch 8 days ago

Yes, this conversation is useless.

You keep saying "what I observe with GenAI can only be the result of 'understanding'" without providing any proofs at all. Just few beliefs.

You just say "look at this behavior, that's the proof". I truly don't think it is: nothing proves that this behavior requires 'understanding'. And nothing you provided helps: all you provided are impressive behaviors and then the unsubstantiated conclusions "and this behavior can only be done with understanding".

At the same time, there are too much clues showing that such behavior does not require understanding, even if it _looks_ incredibly clever:

1. GenAI does not understand (after the training phase) things that humans don't understand. If GenAI had the capacity of building an understanding during training, then there is no reason this understand will coincide with human understanding.

2. Optimisation does not always lead to "understanding". Human brains choose to optimise "learning multiplication table by heart" rather than building a pocket calculator inside the neurons.

3. Human brains, that have "understanding", are working fundamentally differently from GenAI (flow of thoughts, intrinsically intertwined memory and compute, optimised for world-model treatment rather than token treatment, ...). It is an unsubstantiated jump to simply conclude AI has "understanding", while it can be the result of fundamental differences.

4. "Basic" LLM are surprisingly good at creating convincing sentence and yet there are situations where it is blatantly clear they did not understood anything. More advanced SOTA are based of refinement of "basic LLM", and therefore the "sentence construction that is done without understanding" is still used, and impair the SOTA model to build a full understanding.

> Another way to put this, is deep learning models are able to learn higher-order relationships directly, not be memorizing and interpolating across learned points or regions.

It's exactly what I'm saying: deep learning models are very good at learning complex relationships. Such as "I don't know what 'Paris' is, I don't have any understand of what a city is in reality, but when the token Paris is associated with these other tokens in this complex order, even if I never saw it before, I have learnt the complex relationships and therefore I'm able to build a series of token".

They are very good at learning complex relationship that allows them to choose the correct combination even if they did not "understand" the content of the correct combination.

I understand that it is impressive: those relationships are very complex and very numerous (there are billions of them). It is easier to do anthropomorphism and conclude that the AI has "understood".

But again, the main problem is that you just pretend, without any proof, "no, I cannot believe that, I refuse to believe that".

(and, by the way, I personally think that AI (SOTA but also even "basic LLM") do have 'rules' that correspond to some kind of understanding of basic mechanism. I think they have basic "world models". But these world models are optimised "to write text" rather than to "understand the world", and therefore the large majority of AI output is just not-understood token chains)

link

Nevermark 8 days ago

Apologies. Your pushback (frustration and patience) has helped me crystalize my view, thank you.

1. Define understanding.

My definition isn't vague: "a compact representation enabled because that representation's topology closely matches the topology of the relationships being modeling."

Understanding = Scope and Suitability of Behavior / # Parameters.

Useful property: This definition applies across all scales: Scientists and mathematicians increase our understanding, every time patchworks of relationships get replaced with a simpler underlying insight.

Another useful property: It distinguishes between better understanding and having more facts. Facts improve performance but do not (non-trivially) decrease parameters.

What is your definition? In measurable terms?

2. You keep avoiding a basic aspect of modeling:

Higher compactness is achieved by higher representation correspondence between a model and the modeled.

Yes, lower level representations can work. Even well, without good "understanding". But not as compactly. And as problem complexity grows, the relative difference in parameter budgets for high-correspondence and low-correspondence representations explode.

This is not a subtle effect.

The hallmark of lower-level fitting is the far greater number of parameters required.

Dead simple example: Piece-wise linear vs. polynomial fitting of Bezier curves. Accuracy / parameter is far greater for the latter, because the representation matches the relationships being modeled.

That is an intentionally trivial example, but the same relationship holds for any problem.

You keep avoiding that.

3. Today's LLM models are very compact compared to humans.

Compressing the substance of a corpus of global human writing into less than 1% of a single human's parameter space is compact.

Humans have 100–200 trillion, some people think 500 trillion, synapses.

How do you argue that behavior scope and suitability / parameters is not remarkable, when it is remarkable compared to any specific human you could point to?

No human can converse reasonably across the scope of global communication. But these models can. For <1% of a human's parameter budget.

4. Finally, based on your clear definition, how do you argue that humans understand but models do not? Saying we are different is a copout. Defining understanding as us vs. other is both circular and unenlightening. And ignores the real progress models are clearly making relative to humans.

link

Nevermark 7 days ago

Is that more coherent?

link

cauch 7 days ago

1. Okay with your definition.

My point is that you can have the same result with a representation that "closely matches the topology of the relationships being modelled". For example, a representation that "allows relationships between tokens but yet does not care about the meaning or concept not useful to form convincing sentences".

And therefore, it means that you can have convincing text without needing a "representation's topology closely matching the topology of the relationships being modelled", and therefore, according to your own definition: no understanding.

2. It is not true I'm avoiding that. I have answered very clearly.

1) GenAI are not trained to get the higher representation of the world, but to get the best convincing sentence generation. This does not require a full world understanding. Worse, once a convincing sentence generation is reached, there is no gain by getting a better world understanding: the training mechanism that pushes into the correct direction stops and therefore it can go into any direction at all.

2) High compactness does not equal best solution. Even humans don't used "high compactness" when doing basic arithmetic, but use "by heart multiplication table". Being compact is useless if it comes with high complexity each time you need to recompute the output.

3) Very very good approximation can reach higher compactness anyway. Your Bezier curves is a good example: real physical phenomenons are almost never the result of a Bezier equation. A Bezier curve did not understood the phenomenon. When it comes to GenAI, it can "fit" the reality with very close precision with several representations, but the majority of the representation corresponds to an incorrect "understanding" of the reality.

Another example: if I throw a ball in the air, the motion will be at first order a quadratic equation, plus correction due to friction, wind, ... If I just "train" something for "throw a ball", this system may fit a quadratic function plus corrections, but they will achieve the same result with Bezier curves, or Fourier series, or additive Gaussian, or ... But the "understanding" is that the ball is influenced by gravity, which leads to a quadratic equation. The system does not understand that. It has no reason to understand that. And it has no reason to prefer a quadratic equation fit rather than a Bezier fit, on the contrary, the Bezier fit will be more realistic (as the quadratic equation is just the first order approximation).

If you want to understand a paper plane trajectory, it is a complex system, and you probably need plenty of parameters to describe the gravity, the wind at each position and each time, the shape of the plane at each time, ... But you can describe the trajectory with just few parameters using a Bezier curve. Train on plenty of paper plane trajectories, and you will have a system that can give you a very realistic paper plane trajectory based on Bezier curve. And yet, your system has no understanding of the paper plane trajectory: it does not know what are the mechanisms that make the paper plane goes up or down. It just creates a realistic trajectory without knowing why this trajectory is realistic, just that this trajectory makes sense based on the other trajectories it has seen.

3. This argument seems to go against your thesis. You are saying that humans, who "understand" + are not even able to have as much conversation as LLM, have way too much neurons. What are these neurons even for then? You are explaining that LLM are just "something different", a reduced mini-version of a brain, and yet you are also saying that they are able to do the complex things the brain do.

Another way of seeing it, is that LLM are "dropping" things that they don't need to create convincing sentences, such as "understanding the token". They just "get the Bezier curve fit of the relationship" instead of understanding the real mechanisms and concepts.

It's like your Bezier curve example: a system that just creates a realistic paper plane trajectory based on "typical Bezier curve observed during training" will need way less "neurons" than a system that needs to understand the whole aerodynamism of the paper plane.

4. I argue this the same way I say that a system that describe a paper plane trajectory based on best Bezier curves did not understood the mechanism behind how a paper plane trajectory works. I am not saying "I define 'understanding' as what humans do", I am saying that creating convincing sentences does not require understanding, the same way that generating realistic paper plane trajectories does not require understand gravity, Navier-Stokes equations and Brownian motions.

The Bezier curve paper plane trajectory predictor system I have mention, do you think it has understanding of gravity? of Navier-Sotkes? of Brownian motions?

No, it has not. You can open this system. It just has Bezier curve for plenty of examples, and thanks to that, it knows that one trajectory is realistic and another is unrealistic. And at some point, it is also able to give realistic trajectories in brand new situations it has never trained on.

link

Nevermark 7 days ago

I am going to tune the expression form of my definition to:

Understanding = Novel Scope * Suitability / Parameter Count.

> My point is that you can have the same result with a representation that "closely matches the topology of the relationships being modelled". For example, a representation that "allows relationships between tokens but yet does not care about the meaning or concept not useful to form convincing sentences".

You are absolutely right, that lack of internal representation-reality correspondence does not rule out real/convincing performance.

> GenAI are not trained to get the higher representation of the world, but to get the best convincing sentence generation.

This is true of all learning. And it will always be the nature of learning.

Which is why performance is always (should be) measured on novel input.

> High compactness does not equal best solution. Even humans don't used "high compactness" when doing basic arithmetic, but use "by heart multiplication table".

This is a really good point!

It brings up the two useful modes of human representation:

(1) The brain's slow mode is very good at handling deeper and deeper layers of representation. When thinking about arithmetic or more complex math analytically, our understanding does follow a path of increasingly deeper representations. And we are very good at applying these deeper understandings.

(2) Then, our fast mode creates shallow representations of things we do frequently.

I would look at this as (1) reflecting scalable understanding (2) reflecting very limited understanding, but scalable speed.

And we often use both modes together.

I would argue that the understanding is primarily in the slow mode. That the fast mode, is the non-understanding but appropriate response mode. And that it operates with a much reduced scope of appropriate response, but a high percentage of applicability. Meaning, most of the time we don't need to use deep understanding we just need fast appropriate response.

But how to compare the two in scopes where they are equally accurate?

I think "high understanding" representations are those very flexible to being used in ways quite different from how they were learned.

Our slow mode does this very well. Our fast mode not so well, but to the degree it generalizes well to novel situations, that would be an increase in understanding.

Our fast system does generalize, but I would argue that at some point it fails, where our slower deeper representations provide the means of analyzing a situation. So it clearly "understands" better.

It is interesting how quickly understanding from our analytical side translates into operation on our fast side. Clearly, our fast side has very efficient access to new "patterns" that our slow side constructs.

> If you want to understand a paper plane trajectory, it is a complex system, and you probably need plenty of parameters to describe the gravity, the wind at each position and each time, the shape of the plane at each time, ... But you can describe the trajectory with just few parameters using a Bezier curve.

I love this example. It does contrast very different kinds of understanding.

(1) Understanding the fundamental reality in which paper planes exist,

(2) Vs. understanding how paper planes behave.

I think my expression works well here, as long as we take "scope" seriously.

Understanding = Novel Scope * Suitability / Parameter Count.

For paper planes as a hobby, a smaller neuron/parameter budget is achieved by learning the emergent laws of paper planes, not their underlying physics. And understanding paper planes is achieved with this smaller budget.

For understanding paper plane dynamics at a design level, a smaller neuron/parameter budget is achieved by learning the underlying physics of aerodynamics at an intuitive level.

For understanding paper plane dynamics at a world class competition level, a smaller neuron/parameter budget is achieved by learning the underlying physics of aerodynamics at an analytical level.

So these would be three different "understandings", each with their own scope and area of appropriate response to novel situations.

Point taken: The most fundamental correspondence isn't the point of a lot of understanding.

You are right, and my equation works, as long "scope" is interpreted to mean appropriate level of interest, not area of fundamental physics involved. Great point.

Does that get us on the same page? Closer?

> I am saying that creating convincing sentences does not require understanding

As problem complexity goes up, there really is an explosive difference between appropriate response via "familiarity" or lower-level fit, vs. higher level fit, for the same number of parameters.

And it is also a dramatically bigger challenge for lower-level fits to respond well to novel stimuli, given the same number of parameters.

The reason is, is that complex problems operate in higher dimensional spaces, and relationships in higher dimensional spaces have exponentially more complexity for any level of representation. Exponentially.

Linear fits of a 2D bezier are inefficient but work. Linear fits for a 100 dimensional bezier, which isn't very many dimensions from a data standpoint, become ludicrously expensive in parameters.

The dimensionality of human communication is probably the most complex problem ever tackled systematically.

I am trying to think of a way to capture this more concretely. I.e. a way to draw a line in this conversation that stands up on its own. All I can point to, is the complete failure of any lower-level fit when done directly, to acheive a trillionth of a trillionth of trillions of the flexibility that SOTA models demonstrate. The extreme dimensionality of input that LLMs respond to, makes my "trillionths" literal in this case. And we do get a concrete measure of the dimensionality within their capacity, as context windows give us live demonstrations of this.

Note that language is literally highly compressed information, with pervasive non-local interactions. The enormous dimensionality is compounded by dense reactivity, pervasive discontinuities. No other informational artifact compares to language complexity.

When I say that this is a case where either real relationships are learned or the model fails, it is because the number of parameters for a lower-level fit really are beyond imagining.

You can't point to any lower-level fit, where the lower-level fit is basic to the fitting algorithm, that ever achieved even a tiny-grammar tiny-subject-scope toy of a toy version, to what LLMs are doing. Nor can I, despite following progress for decades. Nobody can. The original successes of the first LLMs, modest as they appear now, were completely unprecedented.

There just are not enough parameters, by many orders of magnitude, to do language justice over a context window, and respond sensibly to intentionally novel conversations, without identifying the actual relationships behind it.

So that would be my challenge to you. To identify any verifiable lower-level fit that even approximates LLM behavior at the tiniest of toy levels. Verifiable fits at any given level are easy to do, just train a model where the basis is restricted to that kind of fit.

Otherwise, I can agree that understanding is a continuous property, and that how well something understands something, without strict benchmarking by well thought out benchmarks, involves intuition and judgement. So there can be legitimate differences in how we perceive model understanding, in the absence of direct measures.

Any more thoughts? I have understood both myself and your points better as we went along.

link