|
|
|
|
|
by mordymoop
1156 days ago
|
|
Even very simple and small neural networks that you can easily train and play with on your laptop readily show that this “outputs are just the average of inputs” conception is just wrong. And it’s not wrong in some trickle philosophical sense, it’s wrong in a very clear mathematical sense, as wrong as 2+2=5. One example that’s been used for something like 15+ years is in using the MNIST handwritten digits dataset to recognize and then reproduce the appearances of handwritten digits. To do this, the model finds regularities and similarities in the shapes of digits and learns to express the digits as combinations of primitive shapes. The model will be able to produce 9s or 4s that don’t quite look like any other 9 or 4 in the dataset. It will also be able to find a digit that looks like a weird combination of a 9 and a 2 if you figure out how to express a value from that point in the latent space. It’s simply mathematically naive to call this new 9-2 hybrid an “average” of a 9 and a 2. If you averaged the pixels of a 9 image and a 2 image you would get an ugly nonsense image. The interpolation in the latent space is finding something like a mix between the ideas behind the shape of 9s and the shape of 2s. The model was never shown a 9-2 hybrid during training, but its 9-2 will look a lot like what you would draw if you were asked to draw a 9-2 hybrid. A big LLM is something like 10 orders of magnitude bigger than your MNIST model and the interpolations between concepts it can make are obviously more nuanced than interpolations in latent space between 9 and 2. If you tell it write about “hubristic trout” it will have no trouble at all putting those two concepts together, as easily as the MNIST model produced a 9-2 shape, even though it had never seen an example of a “hubristic trout.” It is weird because all of the above is obvious if you’ve played with any NN architecture much, but seems almost impossible to grasp for a large fraction of people, who will continue to insist that the interpolation in latent space that I just described is what they mean by “averaging”. Perhaps they actually don’t understand how the nonlinearities in the model architecture give rise to the particular mathematical features that make NNs useful and “smart”. Perhaps they see something magical about cognition and don’t realize that we are only ever “interpolating”. I don’t know where the disconnect is. |
|
you see this across social sciences where you can see a lot of fields have papers that come out every decade or so since the 1980s saying that linear regression models are wrong because they don't take into account several concepts such as hierarchy (e.g., students go to different schools), frailty (there is likely unmeasured reasons why some people do the things they do), latent effects (there is likely non-linear processes that are more than the sum of the observations, e.g., traffic flows like a fluid and can have turbulence), auto-correlations/spatial correlations/etc.
In fact, I would argue that a decision tree based model (i.e., gradient boosted trees) will always arrive at a better solution to a human system than any linear regression. But at this point I suppose I have digressed from the original point.