| Why exactly is it a problem that powerful models can memorise training data with random labels, if they generalise well on non-memorisation problems? I also think that a lot of progress is being made seemingly in parallel on this "understanding NN [generalisation]" topic, and the different threads don't necessarily seem to be aware of each other. This article seems to have failed to mention most of this recent progress. Possibly this is an artifact of the time it takes to produce and publish a completed article, but it presents a picture that we are more clueless than we actually are. The most promising, IMO, is outlined in these blog posts: https://towardsdatascience.com/neural-networks-are-fundament... and
https://towardsdatascience.com/deep-neural-networks-are-bias... which summarise 3 research papers. The basic idea is that because NN outputs are usually thresholded to produce a prediction/classification, there is not always a change in the output if a parameter changes. This gives space for there to be a non-uniform distribution over output functions, if weights are initialised randomly. The authors investigate this empirically (with some theoretical justification) and find that this bias is toward "simple" functions, which often generalise well, due to "simple" properties mostly being what we care about in real data. Then generalisation can be explained by SGD simply being more likely to find parameters which cause the NN to express a function that is simpler, out of all functions that perfectly match the training data. There is also some interesting investigations into infinite-width limits - the most well known being the result that NNs are equivalent to Gaussian Processes at initialisation, then Neural Tangent Kernels [1] showing a way to view them through the lens of kernel methods throughout training as well. But then there is Feature Learning in Infinite-Width Neural Networks [2] which seems to be at odds with the kernel approach (due to kernels not learning features) - this paper provides a different method for parametrisation to admit feature learning. So there is progress both in understanding generalisation and understanding when feature learning does & does not occur. (And this is hardly a comprehensive sampling.) [1]: https://arxiv.org/abs/1806.07572
[2]: https://arxiv.org/abs/2011.14522 |
But for the main black-box network? No such visibility. It is undoubtedly doing all kinds of stupid shit that I have no insight into. And I would /love/ to be able to sort it out and get rid of the dumb parts, because I a) want tiny models that run hella fast on slow-ass phones, and b) would love to eliminate sources of spurious correlations to provide better predictions. Getting rid of the dumb parts means more compute to do the smart things.
The high-level theory answers maaaybe tell me about convergence but not whether the NN converges to anything reasonable. Giant models provide a huge amount of flexibility, which help the model find answers, but the answers may not be any good, and we really don't have robust ways to know. (eg, aggregates over big eval datasets may tell me something could be better, but not what or why.)