| > I have yet to see a neural network actually learn representations outside the range in which it was trained Generalization doesn't require learning representations outside of the training set. It requires learning reusable representations that compose in ways that enable solving unseen problems. > On generalization - its still memorization Not sure what you mean by this. This statement sounds self contradictory to me. Generalization requires abstraction / compression. Not sure if that's what you mean by memorization. Overparameterized models are able to generalize (and tend to, when trained appropriately) because there are far more parameterizations that minimize loss by compressing knowledge than there are parameterizations that minimize loss without compression. This is fairly easy to see. Imagine a dataset and model such that the model has barely enough capacity to learn the dataset without compression. The only degrees of freedom would be through changes in basis. In contrast, if the model uses compression, that would increase the degrees of freedom. The more compression, the more degrees of freedom, and the more parameterizations that would minimize the loss. If stochastic gradient descent is sufficiently equally as likely to find any given compressed minimum as any given uncompressed minimum, then the fact that there are exponentially many more compressed minimums than uncompressed minimums means it will tend to find a compressed minimum. Of course this is only a probabilistic argument, and doesn't guarantee compression / generalization. And in fact we know that there are ways to train a model such that it will not generalize, such as training for many epochs on a small dataset without augmentation. |