|
> I am not sure what you mean by complex combinatorial. If we are talking about combinatorial, its combinatorial. N can be very large, but it is going to scale like combinatorial, not something else. The way a LLM works is by creating a space of N dimensions, N being the number of token. This space contains all the possible combinations. The LLM will find the best combination, but will not scan the whole space. To find the best combination, it will minimize the loss function, which is low when the output corresponds to the target. By doing so, it will not explore the combination that "goes in the wrong direction", and therefore it is not true to say that increasing the space as a scale S corresponds to increasing the difficulty of running the model by a scale S. Because of that, while the combination space scales like combinatorial, the model does not. A model with 2 weights (or rather tokens, but the number of weights should be at least the number of tokens) corresponds to 4 combinations (AA, AB, BA, BB can indeed be described by 2 binary weights of value "A" or "B"). A model with 3 weights corresponds to 9 combinations. A model with 4 weights corresponds to 16 combinations. ... A model with N weights corresponds to N to the power N combinations. The number of combination increases a lot, and yet the number of weights increase linearly. In SOTA, we have billions of weights. That is a model that contains a very very very very big number of combinations, something so big that it is difficult to understand for a human. It will not try all of these combination one by one, the gradient descend method will help it finding the best combination without having to do so. So, yes, SOTA are finding "the best combination" amongst an impressively huge number of combinations, yet without having to "scale like combinatorial". > To prove me wrong (as a thought experiment), choose a lower order model, any kind you can imagine that would qualify as modeling without understanding. Demonstrate it can do anything close. That it could possibly scale to the human corpus with just a trillion parameters. Yes. Easy. A SOTA LLM does that. It is a modeling without understanding. It does not understand, it finds the best patterns. And when you put it in a new situation, it uses these patterns to create a new text, without truly understanding the content of the text. And if you ask an additional question, it will use the previous text as context, and create a new text that, as it has been trained to, will be consistent with the output that has been given. Your assertion "you can prove me wrong" is a circular reasoning: you start saying "if a model can do a text that looks realistic to me, then it means it has understanding. To prove me wrong, give me a text that looks realistic to me and has no understanding". Well, I cannot do that, because for you, if it looks realistic, it has to have understanding. > If it the number of parameters goes up far too fast, then that can't be the way deep learning solves the problem with a trillion, or a few billion, either. The combination space grows as N to the power N. So, a trillion parameters is not "just 1000 times bigger" than a billion parameters, but more than 1000 to the power of one billion bigger (the exact value is often even bigger than that). Do you realise the size of the combination space? That is 1 followed by 3 times one billion zeroes. > What we do know, because combinatorial is too resource intensive, is we are not just combinatorial either. I think you don't understand how LLM works: the find the best combinations in a incredibly huge parameter space, but don't need to explore the whole space, just the 1-dimension manifold that is the curve that follow the gradient descend within this huge combination space. There are plenty of clues that SOTA don't "understand". For example, did you notice that SOTA happens to understand what human understand, and don't understand what human don't understand. If indeed the way SOTA works would be by "discovering the true mechanism", it means that it would discover with equal probability mechanisms that humans have already noticed and mechanisms that humans have not already noticed yet. For example, humans know that the Standard Model of particle physics is incomplete, and there are plenty of texts and books about that that the SOTA learnt about. Yet, SOTA did not "understood" the underlying mechanism that explain particle physics. It does not really know what an electron is by "making sense of what this object does", it only knows it as "a language word that can be used in some context in a specific way". And, sure, SOTA is helping with new discoveries, but the way it does it is by using "reasoning" approach. If indeed SOTA creates its own understanding when learning the human language, then it should have the new discovery after the learning, without using any "reasoning" approach, because it would be something that it has already understood. |
Yes, if it consistently produces good output for highly varied stimuli that can be intentionally picked to have been unlikely to ever had obvious representation in the training set, then yes it understands.
I think we are talking past each other a bit.
A series of increasingly challenging datasets, used to capture scaling efficiencies, would ground our discussion.
But the level of performance for models is simply too good vs. the number of parameters to be doing anything trivial.
Deep learning models do something combinatorial models do not. The linear tensor + non-linear transforms do two special things:
1. The tensor itself just projects a linear space into higher dimensions, but its still the same information space. Project a 2D surface into higher dimensions linearly, and there can be more parameters, but it is not more information, since there is an expansion of linear dependence to match.
2a. But then the nonlinear both (a) thresholds, squashes or otherwise alters the linear results, in a way that removes linear dependencies, increasing the useful dimensionality of the representation.
2b. And the squashing also allows dimensions to be folded down.
So by both expanding and flattening representational dimensions, deep learning models are able to model higher-order relationship directly, that any less expressive modeling would require cobbling together many patches of fitting.
Another way to put this, is deep learning models are able to learn higher-order relationships directly, not be memorizing and interpolating across learned points or regions.
So a dramatically greater ability to "understand" is why deep learning models are so much better. They are not doing simple combinatorial fitting.
"Understanding" or not, combinatorial relationships are the low bar for deep learning models, they are inherently great a learning much higher-order relationships.
I am falling asleep at this point. I feel like we need a blackboard and a computer. You are saying a lot of things that make me think, and make sense to me.