| > How do you know the output is not the result of combinatorial interactions? (A bit of an essay, but it is a good question!) REASON 1, How simpler representations fail: Lesser understandings reveal themselves to novel combinations of prompts. Mapping fails immediately because it fails on even trivial differences. Interpolation fails immediately, because the function isn't smooth and the information it needs to model, human language and thought, combines non-linearly, non-locally and with higher-order relationships. Combinatorial fails as soon as you create a prompt that involves novel non-linear or higher-order interactions. I.e. new combinations. REASON 2, Parameter requirements of simpler representations: For human-resembling sensible chats, mapping requires an example of every case. It would require combining the entire training set, with an optimized index. Essentially a search on the whole body with tricks to return anything sensible for even a slight mismatch. Interpolation, ..., I don't even know how that could work. Again the whole corpus of training data, with some kind of gradient composition overlayed across it. It is an interesting research idea, but the possible mixing of tokens makes this unreasonable for anything but toy problems. Combinatorial encodings, would have to have parameters operating across all the possible ways to combine relationships. There can be some relationship compression, to a base set of represented concepts, and then a combinatorial explosion of parameters for how to combine them. I include statistical / stochastic transforms here as continuous combinatorial transforms. Those could do the job, but more parameters than atoms in the universe might be required, for all possible topic/detail compositions. REASON 3, Training corpus requirements to learn successful lesser representations. Obviously the training data, even of all human communication, provides only a fraction of possible exact things that could be said. Not enough data for mapping even if infinite resources for creating a map were available. Interpolation also suffers, because whatever correlations and smooth compressions of the training data can be made, it is still data that barely touches the kinds of sensible compositions that are possible. And the same for combinatorial. There just isn't a fraction of an infinitesimal number of examples of combined topics and details, compared to what can be sensibly combined in any new conversation. You can't extract combinatorial compressions that don't exist. REASON 4, Hiding one representation in another doesn't create opportunities that didn't exist before. These methods all fail when used directly. The problems are not the kind that pushing the same transforms into a deep learning model solves. The requirements for astronomically more parameters and training data are not met by embedding those kinds of representations into another model. SOTA models are not operating with cosmological numbers of parameters, or training data that combinatorially represents concept interactions. Being a deep learning model doesn't somehow lessen the requirements, needed to successfully perform, if it is learning via those lesser representations. REASON 5, Test a model: So let's test whether the model is doing more. If it fails for novel combinations of complex topics, then it might only be doing simpler things. If it is robust to novel situations, then it cannot be operating by doing simpler things that don't scale. Ask a model to: Write up a Supreme Court pleading for the rights of whales based on all that is known about them scientifically, recent whale language developments, and any applicable human rights law, given the relevant Supreme Court is in a parallel universe in Zion of the Matrix, being pleaded by Keanu Reeves, the actor not the character, and written in Dr. Seuss prose, except with as long of sentences as are needed to carry the real technicalities of a suitable filing. And include the assumptions of a back history of whales which have sequestered themselves into a deep hidden underground ocean, where they have been safe until recent excursions by humans which have harmed them. Be specific creating a real history behind those events, with details that are highly relevant to the motivation, reasoning and requests of the pleading. Avoids words with q where possible. That isn't mapping. Interpolating. Combinatorial composition. SOTA models will generate a reasonable, even creative response to a completely novel combination of subjects and requirements, with non-linear interactions. A human would have a hard time doing that, and the model does it nearly instantly with a fraction of the parameters we have. If that isn't "understanding" in some credible sense, I have no idea what understanding looks like. The model is going way beyond its training data, to the relationships in the data that are relevant to combining novel things. To the point it can apply those relationships in combinations it has never encountered. And its makes a trivial task out of it. |
This just means "simpler representations are not enough", not "good representations cannot be complex combinatorial combinations" (complex enough that it is very different to see them for a human).
> REASON 2
Are you saying that I believe that the only way to get human-like text is by doing a near-infinite one-to-one mapping? This is obviously not the case.
You can do, for example, a GAM time-series forecast. This can have a relatively low number of weight, and still return very sensible prediction, and yet not capture the real understanding of the phenomenon they will predict. For example, it does not understand causality, just correlation.
> REASON 3
That is like saying "I've built and algorithm that is able to do 10 + 27, but there is an infinite list of number, so it is impossible for this algorithm to do 23113454453 + 1233253245". That is not true, you just decompose into (53+45), (44+32), ... and add rules to combine these elements together.
It is what is happening with AI: there is enough data to get "some pattern" in the language. Just the patterns, not the understanding of the language itself. And this pattern can be reproduced in plenty of different places.
> REASON 4
This argument is contradicted by "basic LLM" or even simpler model that are performing surprisingly well. Less than SOTA, but if your argument is true, CNN or ARIMAX could never provide better than a coin toss.
> REASON 5
Your example is a good place where the AI will _combine_ patterns learnt from different place. It will pick characteristics of each of your scenarios, and mix them together. The result will look realistic, but it is still applying learnt pattern together.
Also, you did not answered about my human arithmetic, and all your reasons are contradicted by my example there. Humans DO maths partially because they "learnt by hearth" some pattern rather than apply the understanding of fundamental arithmetic. If "answering very well based on pattern" was not a good strategy, or was necessitating infinite weights, or was making it impossible to use these patterns in novel situation, how do you explain that human can even do that themselves? As soon as we admit that humans do "some pattern some times", than we have to admit that there is a continuous spectrum and admit that it allows output that looks realistic being the result of pattern rather than understanding.
By the way, I just saw a new article reaching HN: https://news.ycombinator.com/item?id=48410427 , and it is indeed explaining similar things, and illustrates that the best way for SOTA to deal with arithmetic is by "not understanding it". And yet, when you use one of those SOTA, you would be able to argue each one of your "REASON" to pretend that the model did understood arithmetic.