Hacker News new | ask | show | jobs
by cauch 11 days ago
Two problems with that.

Firstly, how do you know that the optimal way to highly compress complex information is to understand it? You think it is obvious because you are very familiar with "understanding" as a way to summarise complex information. But there can be billions of different ways, outside of human imagination, that is as good or even better.

But secondly, LLM don't find the optimal way, they find the local minimum. Everyone who worked with NN knows that they are prone to come up with spurious pattern, incorrect correlations and bad workaround to guess the correct answer. You regularly need to nudge the NN by creating specifically engineered features to avoid them to fall into the first local minimum.

When it comes to LLM, it is extremely complicated to control to see if the LLM has triggered on a misleading pattern that, by chance, links two "tokens" together, or on a real concept that indeed links two "tokens" together. Basic probability implies that there are probably tons of "fake patterns" engraved into the weight during the LLM training, "fake patterns" that should not exist if there was any kind of "understanding" of the abstract mechanism that links these tokens.

1 comments

> Firstly, how do you know that the optimal way to highly compress complex information is to understand it?

What is your non-performance baseline for "Understanding"? We don't have such a measure for humans.

Understanding is the behavioral ability demonstrated by learning to model something complex well. Beyond mappings, associations, interpolations.

Models clearly do. Mix up the most unlikely combination of non-trivial subjects, and they response sensibly. Those are not averaged, interpolated by any order, or even combinatorially interactions.

There is a reason those kinds of encodings, mappings, associations, interpolations, statistics / stochastics, all failed miserably for decades. Still fail. It took topological transforms, reminiscent of how we compute (dendrite-soma-axon, tensor-sum-nonlinear), and then they lept several orders of magnitude ahead of any alternative.

The problem with models composed of relationships of lower order than the phenomena they are trying to model, is they require combinatorially more parameters to model anything complex.

For simple problems, poor models fail gracefully. For complex problems, poor models just fail.

> Models clearly do. Mix up the most unlikely combination of non-trivial subjects, and they response sensibly. Those are not averaged, interpolated by any order, or even combinatorially interactions.

How do you even know it is the case?

How do you know the output is not the result of combinatorial interactions?

How do you even know that the "sensible" response on unlikely combination is not the result of a simple recipe that "make the response sounds sensible"? Either you, yourself, have some expertise on the subject, and therefore the combination does probably exist in the AI training data, or you don't and you have no idea if the response is sensible or is the usual smooth talk that everyone could come up spending 2 or 3 hours googling on the subject and crafting something sensible.

Worse, you are saying that the model "understand", which means that it discovers the underlying mechanism that drive the output. This "understanding" is a set of equation that link different concept, that explain how one concept affects another concept. So, it is "combinatorial interaction". Not a simple linear one, but guess what, LLM are designed to introduce non-linearity.

Even when AI are able to find new solution of math problem, the result is, like when done by humans, by using existing basic tools to build more complicated ones.

> It took topological transforms, reminiscent of how we compute (dendrite-soma-axon, tensor-sum-nonlinear), and then they lept several orders of magnitude ahead of any alternative.

And yet, the LLM elements that are "similar" or "analogue" to how the human brain works are very small. The human brain has thoughts "flowing", while LLM can only work "by step". The human brain is able to learn on a very reduced dataset, while LLM need more data that a human will ever be able to analyse, even less store. The human brain has "memory" and "context" intrinsically intertwined with how it works, while you can decouple these from the LLM. ...

Finally, here is a good contradiction of having you in one side saying that AI is mimicking the human brain and it is why it works well and on the other hand saying that AI will find the lowest minimum and that this minimum is "understanding how the phenomenom works" rather than "repeating by hearth what it was told during training".

As a human, when you mentally compute 6 times 7, what do you do? Do you do: "6 follows 5, which follows 4, which follows 3, ... and 7 follows 6, which follows 5, ... so we have (1 + 1 + 1 + 1 + 1 + 1) times (1 + 1 + 1 + 1 + 1 + 1 + 1), which is 1 + 1 + 1 + 1 + ..."? I guess you probably don't, you just remember the most helpful element you remember by heart. For example, you remember by hearth that 6x7 is 42. Or you remember that 3x7 is 21, and therefore 6x7 is the double, 42. Or you remember that 7x7 is 49, and therefore 6x7 is 42. Or even have a "feeling" from a mixture of all these (6x7 is somewhere around 40 because 5x7 feels like being around 30 and 7x7 feels like being around 50, and if I think of number in the 40 that "feels" like they are from the 7-multiple-table, I remember 42).

Same thing when a human does 324x42: the majority of humans will decompose it in "simpler" multiplication that they remember by hearth and, and only then, they will combine them. It is a good example of how the brain optimise: by balancing the trade-off of "using memory" and "using understanding": basic operations use memory, but of course it is inefficient to use memory for all numbers, in which case it will use a combination of both.

The way human do basic math operation is not purely by "understanding" arithmetic, it is by relying on what they remember from their training. At the same time, humans know how arithmetic works, and they will use it when relevant. Yet, the human brains prefer to rely on some "learnt by hearth" elements. This is in contradiction with your assertion that optimisation will always lead to "understanding" and that human brains is optimizing the same way AIs do.

This is only one example with numbers, but of course it works with plenty of other things. This is also exactly why humans get "the wrong idea" on plenty of phenomenon, that are then described as "counter-intuitive".

The reason "by hearth" is part of a good strategy rather than "purely understanding" is because there is a trade-off between "memory" and "compute", in both the human brain and AI: it is easier (and therefore a stronger attractor during the optimisation of the process of "getting the correct answer") to do the faster operation "retrieve from memory" than to do the slower operation "retrieve the theory from memory, compute the first step, store it in the short term memory, compute the second step, store it in the short term memory, compute the final answer by adding the first step answer and the second step answer".

> How do you know the output is not the result of combinatorial interactions?

(A bit of an essay, but it is a good question!)

REASON 1, How simpler representations fail:

Lesser understandings reveal themselves to novel combinations of prompts.

Mapping fails immediately because it fails on even trivial differences.

Interpolation fails immediately, because the function isn't smooth and the information it needs to model, human language and thought, combines non-linearly, non-locally and with higher-order relationships.

Combinatorial fails as soon as you create a prompt that involves novel non-linear or higher-order interactions. I.e. new combinations.

REASON 2, Parameter requirements of simpler representations:

For human-resembling sensible chats, mapping requires an example of every case. It would require combining the entire training set, with an optimized index. Essentially a search on the whole body with tricks to return anything sensible for even a slight mismatch.

Interpolation, ..., I don't even know how that could work. Again the whole corpus of training data, with some kind of gradient composition overlayed across it. It is an interesting research idea, but the possible mixing of tokens makes this unreasonable for anything but toy problems.

Combinatorial encodings, would have to have parameters operating across all the possible ways to combine relationships. There can be some relationship compression, to a base set of represented concepts, and then a combinatorial explosion of parameters for how to combine them.

I include statistical / stochastic transforms here as continuous combinatorial transforms.

Those could do the job, but more parameters than atoms in the universe might be required, for all possible topic/detail compositions.

REASON 3, Training corpus requirements to learn successful lesser representations.

Obviously the training data, even of all human communication, provides only a fraction of possible exact things that could be said. Not enough data for mapping even if infinite resources for creating a map were available.

Interpolation also suffers, because whatever correlations and smooth compressions of the training data can be made, it is still data that barely touches the kinds of sensible compositions that are possible.

And the same for combinatorial. There just isn't a fraction of an infinitesimal number of examples of combined topics and details, compared to what can be sensibly combined in any new conversation. You can't extract combinatorial compressions that don't exist.

REASON 4, Hiding one representation in another doesn't create opportunities that didn't exist before.

These methods all fail when used directly. The problems are not the kind that pushing the same transforms into a deep learning model solves.

The requirements for astronomically more parameters and training data are not met by embedding those kinds of representations into another model.

SOTA models are not operating with cosmological numbers of parameters, or training data that combinatorially represents concept interactions.

Being a deep learning model doesn't somehow lessen the requirements, needed to successfully perform, if it is learning via those lesser representations.

REASON 5, Test a model:

So let's test whether the model is doing more. If it fails for novel combinations of complex topics, then it might only be doing simpler things.

If it is robust to novel situations, then it cannot be operating by doing simpler things that don't scale.

Ask a model to: Write up a Supreme Court pleading for the rights of whales based on all that is known about them scientifically, recent whale language developments, and any applicable human rights law, given the relevant Supreme Court is in a parallel universe in Zion of the Matrix, being pleaded by Keanu Reeves, the actor not the character, and written in Dr. Seuss prose, except with as long of sentences as are needed to carry the real technicalities of a suitable filing. And include the assumptions of a back history of whales which have sequestered themselves into a deep hidden underground ocean, where they have been safe until recent excursions by humans which have harmed them. Be specific creating a real history behind those events, with details that are highly relevant to the motivation, reasoning and requests of the pleading. Avoids words with q where possible.

That isn't mapping. Interpolating. Combinatorial composition. SOTA models will generate a reasonable, even creative response to a completely novel combination of subjects and requirements, with non-linear interactions.

A human would have a hard time doing that, and the model does it nearly instantly with a fraction of the parameters we have.

If that isn't "understanding" in some credible sense, I have no idea what understanding looks like. The model is going way beyond its training data, to the relationships in the data that are relevant to combining novel things. To the point it can apply those relationships in combinations it has never encountered. And its makes a trivial task out of it.

> REASON 1

This just means "simpler representations are not enough", not "good representations cannot be complex combinatorial combinations" (complex enough that it is very different to see them for a human).

> REASON 2

Are you saying that I believe that the only way to get human-like text is by doing a near-infinite one-to-one mapping? This is obviously not the case.

You can do, for example, a GAM time-series forecast. This can have a relatively low number of weight, and still return very sensible prediction, and yet not capture the real understanding of the phenomenon they will predict. For example, it does not understand causality, just correlation.

> REASON 3

That is like saying "I've built and algorithm that is able to do 10 + 27, but there is an infinite list of number, so it is impossible for this algorithm to do 23113454453 + 1233253245". That is not true, you just decompose into (53+45), (44+32), ... and add rules to combine these elements together.

It is what is happening with AI: there is enough data to get "some pattern" in the language. Just the patterns, not the understanding of the language itself. And this pattern can be reproduced in plenty of different places.

> REASON 4

This argument is contradicted by "basic LLM" or even simpler model that are performing surprisingly well. Less than SOTA, but if your argument is true, CNN or ARIMAX could never provide better than a coin toss.

> REASON 5

Your example is a good place where the AI will _combine_ patterns learnt from different place. It will pick characteristics of each of your scenarios, and mix them together. The result will look realistic, but it is still applying learnt pattern together.

Also, you did not answered about my human arithmetic, and all your reasons are contradicted by my example there. Humans DO maths partially because they "learnt by hearth" some pattern rather than apply the understanding of fundamental arithmetic. If "answering very well based on pattern" was not a good strategy, or was necessitating infinite weights, or was making it impossible to use these patterns in novel situation, how do you explain that human can even do that themselves? As soon as we admit that humans do "some pattern some times", than we have to admit that there is a continuous spectrum and admit that it allows output that looks realistic being the result of pattern rather than understanding.

By the way, I just saw a new article reaching HN: https://news.ycombinator.com/item?id=48410427 , and it is indeed explaining similar things, and illustrates that the best way for SOTA to deal with arithmetic is by "not understanding it". And yet, when you use one of those SOTA, you would be able to argue each one of your "REASON" to pretend that the model did understood arithmetic.

I am not sure what you mean by complex combinatorial. If we are talking about combinatorial, its combinatorial. N can be very large, but it is going to scale like combinatorial, not something else.

I just started out with mapping to be systematic. Mapping is ground zero, then interpolation i.e. any smooth fitting function or basis, then combinatorial where different bases are recognized and then project relative to their relevance to a new input.

Each of those increase modeling efficiency and power, but even combinatorial doesn't scale to problems like language.

I may be doing a poor job communicating. A formal breakdown of the scaling issues with lower order, but scaled to make up for it, modeling would be a great paper.

To prove me wrong (as a thought experiment), choose a lower order model, any kind you can imagine that would qualify as modeling without understanding. Demonstrate it can do anything close. That it could possibly scale to the human corpus with just a trillion parameters.

If it the number of parameters goes up far too fast, then that can't be the way deep learning solves the problem with a trillion, or a few billion, either.

And consider the other side. We have no idea how our own brains are lifting up what is relevant vs. what is not. We are used to it happening. We call it "understanding". But we don't know how it works, how we work. Despite experiencing it.

What we do know, because combinatorial is too resource intensive, is we are not just combinatorial either.

> I am not sure what you mean by complex combinatorial. If we are talking about combinatorial, its combinatorial. N can be very large, but it is going to scale like combinatorial, not something else.

The way a LLM works is by creating a space of N dimensions, N being the number of token. This space contains all the possible combinations. The LLM will find the best combination, but will not scan the whole space. To find the best combination, it will minimize the loss function, which is low when the output corresponds to the target. By doing so, it will not explore the combination that "goes in the wrong direction", and therefore it is not true to say that increasing the space as a scale S corresponds to increasing the difficulty of running the model by a scale S.

Because of that, while the combination space scales like combinatorial, the model does not. A model with 2 weights (or rather tokens, but the number of weights should be at least the number of tokens) corresponds to 4 combinations (AA, AB, BA, BB can indeed be described by 2 binary weights of value "A" or "B"). A model with 3 weights corresponds to 9 combinations. A model with 4 weights corresponds to 16 combinations. ... A model with N weights corresponds to N to the power N combinations. The number of combination increases a lot, and yet the number of weights increase linearly.

In SOTA, we have billions of weights. That is a model that contains a very very very very big number of combinations, something so big that it is difficult to understand for a human. It will not try all of these combination one by one, the gradient descend method will help it finding the best combination without having to do so.

So, yes, SOTA are finding "the best combination" amongst an impressively huge number of combinations, yet without having to "scale like combinatorial".

> To prove me wrong (as a thought experiment), choose a lower order model, any kind you can imagine that would qualify as modeling without understanding. Demonstrate it can do anything close. That it could possibly scale to the human corpus with just a trillion parameters.

Yes. Easy. A SOTA LLM does that. It is a modeling without understanding. It does not understand, it finds the best patterns. And when you put it in a new situation, it uses these patterns to create a new text, without truly understanding the content of the text. And if you ask an additional question, it will use the previous text as context, and create a new text that, as it has been trained to, will be consistent with the output that has been given.

Your assertion "you can prove me wrong" is a circular reasoning: you start saying "if a model can do a text that looks realistic to me, then it means it has understanding. To prove me wrong, give me a text that looks realistic to me and has no understanding". Well, I cannot do that, because for you, if it looks realistic, it has to have understanding.

> If it the number of parameters goes up far too fast, then that can't be the way deep learning solves the problem with a trillion, or a few billion, either.

The combination space grows as N to the power N. So, a trillion parameters is not "just 1000 times bigger" than a billion parameters, but more than 1000 to the power of one billion bigger (the exact value is often even bigger than that). Do you realise the size of the combination space? That is 1 followed by 3 times one billion zeroes.

> What we do know, because combinatorial is too resource intensive, is we are not just combinatorial either.

I think you don't understand how LLM works: the find the best combinations in a incredibly huge parameter space, but don't need to explore the whole space, just the 1-dimension manifold that is the curve that follow the gradient descend within this huge combination space.

There are plenty of clues that SOTA don't "understand". For example, did you notice that SOTA happens to understand what human understand, and don't understand what human don't understand. If indeed the way SOTA works would be by "discovering the true mechanism", it means that it would discover with equal probability mechanisms that humans have already noticed and mechanisms that humans have not already noticed yet. For example, humans know that the Standard Model of particle physics is incomplete, and there are plenty of texts and books about that that the SOTA learnt about. Yet, SOTA did not "understood" the underlying mechanism that explain particle physics. It does not really know what an electron is by "making sense of what this object does", it only knows it as "a language word that can be used in some context in a specific way".

And, sure, SOTA is helping with new discoveries, but the way it does it is by using "reasoning" approach. If indeed SOTA creates its own understanding when learning the human language, then it should have the new discovery after the learning, without using any "reasoning" approach, because it would be something that it has already understood.