| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by cauch 4 days ago

> REASON 1

This just means "simpler representations are not enough", not "good representations cannot be complex combinatorial combinations" (complex enough that it is very different to see them for a human).

> REASON 2

Are you saying that I believe that the only way to get human-like text is by doing a near-infinite one-to-one mapping? This is obviously not the case.

You can do, for example, a GAM time-series forecast. This can have a relatively low number of weight, and still return very sensible prediction, and yet not capture the real understanding of the phenomenon they will predict. For example, it does not understand causality, just correlation.

> REASON 3

That is like saying "I've built and algorithm that is able to do 10 + 27, but there is an infinite list of number, so it is impossible for this algorithm to do 23113454453 + 1233253245". That is not true, you just decompose into (53+45), (44+32), ... and add rules to combine these elements together.

It is what is happening with AI: there is enough data to get "some pattern" in the language. Just the patterns, not the understanding of the language itself. And this pattern can be reproduced in plenty of different places.

> REASON 4

This argument is contradicted by "basic LLM" or even simpler model that are performing surprisingly well. Less than SOTA, but if your argument is true, CNN or ARIMAX could never provide better than a coin toss.

> REASON 5

Your example is a good place where the AI will _combine_ patterns learnt from different place. It will pick characteristics of each of your scenarios, and mix them together. The result will look realistic, but it is still applying learnt pattern together.

Also, you did not answered about my human arithmetic, and all your reasons are contradicted by my example there. Humans DO maths partially because they "learnt by hearth" some pattern rather than apply the understanding of fundamental arithmetic. If "answering very well based on pattern" was not a good strategy, or was necessitating infinite weights, or was making it impossible to use these patterns in novel situation, how do you explain that human can even do that themselves? As soon as we admit that humans do "some pattern some times", than we have to admit that there is a continuous spectrum and admit that it allows output that looks realistic being the result of pattern rather than understanding.

By the way, I just saw a new article reaching HN: https://news.ycombinator.com/item?id=48410427 , and it is indeed explaining similar things, and illustrates that the best way for SOTA to deal with arithmetic is by "not understanding it". And yet, when you use one of those SOTA, you would be able to argue each one of your "REASON" to pretend that the model did understood arithmetic.

1 comments

Nevermark 4 days ago

I am not sure what you mean by complex combinatorial. If we are talking about combinatorial, its combinatorial. N can be very large, but it is going to scale like combinatorial, not something else.

I just started out with mapping to be systematic. Mapping is ground zero, then interpolation i.e. any smooth fitting function or basis, then combinatorial where different bases are recognized and then project relative to their relevance to a new input.

Each of those increase modeling efficiency and power, but even combinatorial doesn't scale to problems like language.

I may be doing a poor job communicating. A formal breakdown of the scaling issues with lower order, but scaled to make up for it, modeling would be a great paper.

To prove me wrong (as a thought experiment), choose a lower order model, any kind you can imagine that would qualify as modeling without understanding. Demonstrate it can do anything close. That it could possibly scale to the human corpus with just a trillion parameters.

If it the number of parameters goes up far too fast, then that can't be the way deep learning solves the problem with a trillion, or a few billion, either.

And consider the other side. We have no idea how our own brains are lifting up what is relevant vs. what is not. We are used to it happening. We call it "understanding". But we don't know how it works, how we work. Despite experiencing it.

What we do know, because combinatorial is too resource intensive, is we are not just combinatorial either.

link

cauch 4 days ago

> I am not sure what you mean by complex combinatorial. If we are talking about combinatorial, its combinatorial. N can be very large, but it is going to scale like combinatorial, not something else.

The way a LLM works is by creating a space of N dimensions, N being the number of token. This space contains all the possible combinations. The LLM will find the best combination, but will not scan the whole space. To find the best combination, it will minimize the loss function, which is low when the output corresponds to the target. By doing so, it will not explore the combination that "goes in the wrong direction", and therefore it is not true to say that increasing the space as a scale S corresponds to increasing the difficulty of running the model by a scale S.

Because of that, while the combination space scales like combinatorial, the model does not. A model with 2 weights (or rather tokens, but the number of weights should be at least the number of tokens) corresponds to 4 combinations (AA, AB, BA, BB can indeed be described by 2 binary weights of value "A" or "B"). A model with 3 weights corresponds to 9 combinations. A model with 4 weights corresponds to 16 combinations. ... A model with N weights corresponds to N to the power N combinations. The number of combination increases a lot, and yet the number of weights increase linearly.

In SOTA, we have billions of weights. That is a model that contains a very very very very big number of combinations, something so big that it is difficult to understand for a human. It will not try all of these combination one by one, the gradient descend method will help it finding the best combination without having to do so.

So, yes, SOTA are finding "the best combination" amongst an impressively huge number of combinations, yet without having to "scale like combinatorial".

> To prove me wrong (as a thought experiment), choose a lower order model, any kind you can imagine that would qualify as modeling without understanding. Demonstrate it can do anything close. That it could possibly scale to the human corpus with just a trillion parameters.

Yes. Easy. A SOTA LLM does that. It is a modeling without understanding. It does not understand, it finds the best patterns. And when you put it in a new situation, it uses these patterns to create a new text, without truly understanding the content of the text. And if you ask an additional question, it will use the previous text as context, and create a new text that, as it has been trained to, will be consistent with the output that has been given.

Your assertion "you can prove me wrong" is a circular reasoning: you start saying "if a model can do a text that looks realistic to me, then it means it has understanding. To prove me wrong, give me a text that looks realistic to me and has no understanding". Well, I cannot do that, because for you, if it looks realistic, it has to have understanding.

> If it the number of parameters goes up far too fast, then that can't be the way deep learning solves the problem with a trillion, or a few billion, either.

The combination space grows as N to the power N. So, a trillion parameters is not "just 1000 times bigger" than a billion parameters, but more than 1000 to the power of one billion bigger (the exact value is often even bigger than that). Do you realise the size of the combination space? That is 1 followed by 3 times one billion zeroes.

> What we do know, because combinatorial is too resource intensive, is we are not just combinatorial either.

I think you don't understand how LLM works: the find the best combinations in a incredibly huge parameter space, but don't need to explore the whole space, just the 1-dimension manifold that is the curve that follow the gradient descend within this huge combination space.

There are plenty of clues that SOTA don't "understand". For example, did you notice that SOTA happens to understand what human understand, and don't understand what human don't understand. If indeed the way SOTA works would be by "discovering the true mechanism", it means that it would discover with equal probability mechanisms that humans have already noticed and mechanisms that humans have not already noticed yet. For example, humans know that the Standard Model of particle physics is incomplete, and there are plenty of texts and books about that that the SOTA learnt about. Yet, SOTA did not "understood" the underlying mechanism that explain particle physics. It does not really know what an electron is by "making sense of what this object does", it only knows it as "a language word that can be used in some context in a specific way".

And, sure, SOTA is helping with new discoveries, but the way it does it is by using "reasoning" approach. If indeed SOTA creates its own understanding when learning the human language, then it should have the new discovery after the learning, without using any "reasoning" approach, because it would be something that it has already understood.

link

Nevermark 4 days ago

> Well, I cannot do that, because for you, if it looks realistic, it has to have understanding.

Yes, if it consistently produces good output for highly varied stimuli that can be intentionally picked to have been unlikely to ever had obvious representation in the training set, then yes it understands.

I think we are talking past each other a bit.

A series of increasingly challenging datasets, used to capture scaling efficiencies, would ground our discussion.

But the level of performance for models is simply too good vs. the number of parameters to be doing anything trivial.

Deep learning models do something combinatorial models do not. The linear tensor + non-linear transforms do two special things:

1. The tensor itself just projects a linear space into higher dimensions, but its still the same information space. Project a 2D surface into higher dimensions linearly, and there can be more parameters, but it is not more information, since there is an expansion of linear dependence to match.

2a. But then the nonlinear both (a) thresholds, squashes or otherwise alters the linear results, in a way that removes linear dependencies, increasing the useful dimensionality of the representation.

2b. And the squashing also allows dimensions to be folded down.

So by both expanding and flattening representational dimensions, deep learning models are able to model higher-order relationship directly, that any less expressive modeling would require cobbling together many patches of fitting.

Another way to put this, is deep learning models are able to learn higher-order relationships directly, not be memorizing and interpolating across learned points or regions.

So a dramatically greater ability to "understand" is why deep learning models are so much better. They are not doing simple combinatorial fitting.

"Understanding" or not, combinatorial relationships are the low bar for deep learning models, they are inherently great a learning much higher-order relationships.

I am falling asleep at this point. I feel like we need a blackboard and a computer. You are saying a lot of things that make me think, and make sense to me.

link

cauch 3 days ago

Yes, this conversation is useless.

You keep saying "what I observe with GenAI can only be the result of 'understanding'" without providing any proofs at all. Just few beliefs.

You just say "look at this behavior, that's the proof". I truly don't think it is: nothing proves that this behavior requires 'understanding'. And nothing you provided helps: all you provided are impressive behaviors and then the unsubstantiated conclusions "and this behavior can only be done with understanding".

At the same time, there are too much clues showing that such behavior does not require understanding, even if it _looks_ incredibly clever:

1. GenAI does not understand (after the training phase) things that humans don't understand. If GenAI had the capacity of building an understanding during training, then there is no reason this understand will coincide with human understanding.

2. Optimisation does not always lead to "understanding". Human brains choose to optimise "learning multiplication table by heart" rather than building a pocket calculator inside the neurons.

3. Human brains, that have "understanding", are working fundamentally differently from GenAI (flow of thoughts, intrinsically intertwined memory and compute, optimised for world-model treatment rather than token treatment, ...). It is an unsubstantiated jump to simply conclude AI has "understanding", while it can be the result of fundamental differences.

4. "Basic" LLM are surprisingly good at creating convincing sentence and yet there are situations where it is blatantly clear they did not understood anything. More advanced SOTA are based of refinement of "basic LLM", and therefore the "sentence construction that is done without understanding" is still used, and impair the SOTA model to build a full understanding.

> Another way to put this, is deep learning models are able to learn higher-order relationships directly, not be memorizing and interpolating across learned points or regions.

It's exactly what I'm saying: deep learning models are very good at learning complex relationships. Such as "I don't know what 'Paris' is, I don't have any understand of what a city is in reality, but when the token Paris is associated with these other tokens in this complex order, even if I never saw it before, I have learnt the complex relationships and therefore I'm able to build a series of token".

They are very good at learning complex relationship that allows them to choose the correct combination even if they did not "understand" the content of the correct combination.

I understand that it is impressive: those relationships are very complex and very numerous (there are billions of them). It is easier to do anthropomorphism and conclude that the AI has "understood".

But again, the main problem is that you just pretend, without any proof, "no, I cannot believe that, I refuse to believe that".

(and, by the way, I personally think that AI (SOTA but also even "basic LLM") do have 'rules' that correspond to some kind of understanding of basic mechanism. I think they have basic "world models". But these world models are optimised "to write text" rather than to "understand the world", and therefore the large majority of AI output is just not-understood token chains)

link

Nevermark 3 days ago

Apologies. Your pushback (frustration and patience) has helped me crystalize my view, thank you.

1. Define understanding.

My definition isn't vague: "a compact representation enabled because that representation's topology closely matches the topology of the relationships being modeling."

Understanding = Scope and Suitability of Behavior / # Parameters.

Useful property: This definition applies across all scales: Scientists and mathematicians increase our understanding, every time patchworks of relationships get replaced with a simpler underlying insight.

Another useful property: It distinguishes between better understanding and having more facts. Facts improve performance but do not (non-trivially) decrease parameters.

What is your definition? In measurable terms?

2. You keep avoiding a basic aspect of modeling:

Higher compactness is achieved by higher representation correspondence between a model and the modeled.

Yes, lower level representations can work. Even well, without good "understanding". But not as compactly. And as problem complexity grows, the relative difference in parameter budgets for high-correspondence and low-correspondence representations explode.

This is not a subtle effect.

The hallmark of lower-level fitting is the far greater number of parameters required.

Dead simple example: Piece-wise linear vs. polynomial fitting of Bezier curves. Accuracy / parameter is far greater for the latter, because the representation matches the relationships being modeled.

That is an intentionally trivial example, but the same relationship holds for any problem.

You keep avoiding that.

3. Today's LLM models are very compact compared to humans.

Compressing the substance of a corpus of global human writing into less than 1% of a single human's parameter space is compact.

Humans have 100–200 trillion, some people think 500 trillion, synapses.

How do you argue that behavior scope and suitability / parameters is not remarkable, when it is remarkable compared to any specific human you could point to?

No human can converse reasonably across the scope of global communication. But these models can. For <1% of a human's parameter budget.

4. Finally, based on your clear definition, how do you argue that humans understand but models do not? Saying we are different is a copout. Defining understanding as us vs. other is both circular and unenlightening. And ignores the real progress models are clearly making relative to humans.

link

Nevermark 3 days ago

Is that more coherent?

link

cauch 2 days ago

1. Okay with your definition.

My point is that you can have the same result with a representation that "closely matches the topology of the relationships being modelled". For example, a representation that "allows relationships between tokens but yet does not care about the meaning or concept not useful to form convincing sentences".

And therefore, it means that you can have convincing text without needing a "representation's topology closely matching the topology of the relationships being modelled", and therefore, according to your own definition: no understanding.

2. It is not true I'm avoiding that. I have answered very clearly.

1) GenAI are not trained to get the higher representation of the world, but to get the best convincing sentence generation. This does not require a full world understanding. Worse, once a convincing sentence generation is reached, there is no gain by getting a better world understanding: the training mechanism that pushes into the correct direction stops and therefore it can go into any direction at all.

2) High compactness does not equal best solution. Even humans don't used "high compactness" when doing basic arithmetic, but use "by heart multiplication table". Being compact is useless if it comes with high complexity each time you need to recompute the output.

3) Very very good approximation can reach higher compactness anyway. Your Bezier curves is a good example: real physical phenomenons are almost never the result of a Bezier equation. A Bezier curve did not understood the phenomenon. When it comes to GenAI, it can "fit" the reality with very close precision with several representations, but the majority of the representation corresponds to an incorrect "understanding" of the reality.

Another example: if I throw a ball in the air, the motion will be at first order a quadratic equation, plus correction due to friction, wind, ... If I just "train" something for "throw a ball", this system may fit a quadratic function plus corrections, but they will achieve the same result with Bezier curves, or Fourier series, or additive Gaussian, or ... But the "understanding" is that the ball is influenced by gravity, which leads to a quadratic equation. The system does not understand that. It has no reason to understand that. And it has no reason to prefer a quadratic equation fit rather than a Bezier fit, on the contrary, the Bezier fit will be more realistic (as the quadratic equation is just the first order approximation).

If you want to understand a paper plane trajectory, it is a complex system, and you probably need plenty of parameters to describe the gravity, the wind at each position and each time, the shape of the plane at each time, ... But you can describe the trajectory with just few parameters using a Bezier curve. Train on plenty of paper plane trajectories, and you will have a system that can give you a very realistic paper plane trajectory based on Bezier curve. And yet, your system has no understanding of the paper plane trajectory: it does not know what are the mechanisms that make the paper plane goes up or down. It just creates a realistic trajectory without knowing why this trajectory is realistic, just that this trajectory makes sense based on the other trajectories it has seen.

3. This argument seems to go against your thesis. You are saying that humans, who "understand" + are not even able to have as much conversation as LLM, have way too much neurons. What are these neurons even for then? You are explaining that LLM are just "something different", a reduced mini-version of a brain, and yet you are also saying that they are able to do the complex things the brain do.

Another way of seeing it, is that LLM are "dropping" things that they don't need to create convincing sentences, such as "understanding the token". They just "get the Bezier curve fit of the relationship" instead of understanding the real mechanisms and concepts.

It's like your Bezier curve example: a system that just creates a realistic paper plane trajectory based on "typical Bezier curve observed during training" will need way less "neurons" than a system that needs to understand the whole aerodynamism of the paper plane.

4. I argue this the same way I say that a system that describe a paper plane trajectory based on best Bezier curves did not understood the mechanism behind how a paper plane trajectory works. I am not saying "I define 'understanding' as what humans do", I am saying that creating convincing sentences does not require understanding, the same way that generating realistic paper plane trajectories does not require understand gravity, Navier-Stokes equations and Brownian motions.

The Bezier curve paper plane trajectory predictor system I have mention, do you think it has understanding of gravity? of Navier-Sotkes? of Brownian motions?

No, it has not. You can open this system. It just has Bezier curve for plenty of examples, and thanks to that, it knows that one trajectory is realistic and another is unrealistic. And at some point, it is also able to give realistic trajectories in brand new situations it has never trained on.

link