|
> Models clearly do. Mix up the most unlikely combination of non-trivial subjects, and they response sensibly. Those are not averaged, interpolated by any order, or even combinatorially interactions. How do you even know it is the case? How do you know the output is not the result of combinatorial interactions? How do you even know that the "sensible" response on unlikely combination is not the result of a simple recipe that "make the response sounds sensible"? Either you, yourself, have some expertise on the subject, and therefore the combination does probably exist in the AI training data, or you don't and you have no idea if the response is sensible or is the usual smooth talk that everyone could come up spending 2 or 3 hours googling on the subject and crafting something sensible. Worse, you are saying that the model "understand", which means that it discovers the underlying mechanism that drive the output. This "understanding" is a set of equation that link different concept, that explain how one concept affects another concept. So, it is "combinatorial interaction". Not a simple linear one, but guess what, LLM are designed to introduce non-linearity. Even when AI are able to find new solution of math problem, the result is, like when done by humans, by using existing basic tools to build more complicated ones. > It took topological transforms, reminiscent of how we compute (dendrite-soma-axon, tensor-sum-nonlinear), and then they lept several orders of magnitude ahead of any alternative. And yet, the LLM elements that are "similar" or "analogue" to how the human brain works are very small. The human brain has thoughts "flowing", while LLM can only work "by step". The human brain is able to learn on a very reduced dataset, while LLM need more data that a human will ever be able to analyse, even less store. The human brain has "memory" and "context" intrinsically intertwined with how it works, while you can decouple these from the LLM. ... Finally, here is a good contradiction of having you in one side saying that AI is mimicking the human brain and it is why it works well and on the other hand saying that AI will find the lowest minimum and that this minimum is "understanding how the phenomenom works" rather than "repeating by hearth what it was told during training". As a human, when you mentally compute 6 times 7, what do you do?
Do you do: "6 follows 5, which follows 4, which follows 3, ... and 7 follows 6, which follows 5, ... so we have (1 + 1 + 1 + 1 + 1 + 1) times (1 + 1 + 1 + 1 + 1 + 1 + 1), which is 1 + 1 + 1 + 1 + ..."?
I guess you probably don't, you just remember the most helpful element you remember by heart. For example, you remember by hearth that 6x7 is 42. Or you remember that 3x7 is 21, and therefore 6x7 is the double, 42. Or you remember that 7x7 is 49, and therefore 6x7 is 42. Or even have a "feeling" from a mixture of all these (6x7 is somewhere around 40 because 5x7 feels like being around 30 and 7x7 feels like being around 50, and if I think of number in the 40 that "feels" like they are from the 7-multiple-table, I remember 42). Same thing when a human does 324x42: the majority of humans will decompose it in "simpler" multiplication that they remember by hearth and, and only then, they will combine them. It is a good example of how the brain optimise: by balancing the trade-off of "using memory" and "using understanding": basic operations use memory, but of course it is inefficient to use memory for all numbers, in which case it will use a combination of both. The way human do basic math operation is not purely by "understanding" arithmetic, it is by relying on what they remember from their training. At the same time, humans know how arithmetic works, and they will use it when relevant. Yet, the human brains prefer to rely on some "learnt by hearth" elements. This is in contradiction with your assertion that optimisation will always lead to "understanding" and that human brains is optimizing the same way AIs do. This is only one example with numbers, but of course it works with plenty of other things. This is also exactly why humans get "the wrong idea" on plenty of phenomenon, that are then described as "counter-intuitive". The reason "by hearth" is part of a good strategy rather than "purely understanding" is because there is a trade-off between "memory" and "compute", in both the human brain and AI: it is easier (and therefore a stronger attractor during the optimisation of the process of "getting the correct answer") to do the faster operation "retrieve from memory" than to do the slower operation "retrieve the theory from memory, compute the first step, store it in the short term memory, compute the second step, store it in the short term memory, compute the final answer by adding the first step answer and the second step answer". |
(A bit of an essay, but it is a good question!)
REASON 1, How simpler representations fail:
Lesser understandings reveal themselves to novel combinations of prompts.
Mapping fails immediately because it fails on even trivial differences.
Interpolation fails immediately, because the function isn't smooth and the information it needs to model, human language and thought, combines non-linearly, non-locally and with higher-order relationships.
Combinatorial fails as soon as you create a prompt that involves novel non-linear or higher-order interactions. I.e. new combinations.
REASON 2, Parameter requirements of simpler representations:
For human-resembling sensible chats, mapping requires an example of every case. It would require combining the entire training set, with an optimized index. Essentially a search on the whole body with tricks to return anything sensible for even a slight mismatch.
Interpolation, ..., I don't even know how that could work. Again the whole corpus of training data, with some kind of gradient composition overlayed across it. It is an interesting research idea, but the possible mixing of tokens makes this unreasonable for anything but toy problems.
Combinatorial encodings, would have to have parameters operating across all the possible ways to combine relationships. There can be some relationship compression, to a base set of represented concepts, and then a combinatorial explosion of parameters for how to combine them.
I include statistical / stochastic transforms here as continuous combinatorial transforms.
Those could do the job, but more parameters than atoms in the universe might be required, for all possible topic/detail compositions.
REASON 3, Training corpus requirements to learn successful lesser representations.
Obviously the training data, even of all human communication, provides only a fraction of possible exact things that could be said. Not enough data for mapping even if infinite resources for creating a map were available.
Interpolation also suffers, because whatever correlations and smooth compressions of the training data can be made, it is still data that barely touches the kinds of sensible compositions that are possible.
And the same for combinatorial. There just isn't a fraction of an infinitesimal number of examples of combined topics and details, compared to what can be sensibly combined in any new conversation. You can't extract combinatorial compressions that don't exist.
REASON 4, Hiding one representation in another doesn't create opportunities that didn't exist before.
These methods all fail when used directly. The problems are not the kind that pushing the same transforms into a deep learning model solves.
The requirements for astronomically more parameters and training data are not met by embedding those kinds of representations into another model.
SOTA models are not operating with cosmological numbers of parameters, or training data that combinatorially represents concept interactions.
Being a deep learning model doesn't somehow lessen the requirements, needed to successfully perform, if it is learning via those lesser representations.
REASON 5, Test a model:
So let's test whether the model is doing more. If it fails for novel combinations of complex topics, then it might only be doing simpler things.
If it is robust to novel situations, then it cannot be operating by doing simpler things that don't scale.
Ask a model to: Write up a Supreme Court pleading for the rights of whales based on all that is known about them scientifically, recent whale language developments, and any applicable human rights law, given the relevant Supreme Court is in a parallel universe in Zion of the Matrix, being pleaded by Keanu Reeves, the actor not the character, and written in Dr. Seuss prose, except with as long of sentences as are needed to carry the real technicalities of a suitable filing. And include the assumptions of a back history of whales which have sequestered themselves into a deep hidden underground ocean, where they have been safe until recent excursions by humans which have harmed them. Be specific creating a real history behind those events, with details that are highly relevant to the motivation, reasoning and requests of the pleading. Avoids words with q where possible.
That isn't mapping. Interpolating. Combinatorial composition. SOTA models will generate a reasonable, even creative response to a completely novel combination of subjects and requirements, with non-linear interactions.
A human would have a hard time doing that, and the model does it nearly instantly with a fraction of the parameters we have.
If that isn't "understanding" in some credible sense, I have no idea what understanding looks like. The model is going way beyond its training data, to the relationships in the data that are relevant to combining novel things. To the point it can apply those relationships in combinations it has never encountered. And its makes a trivial task out of it.