| HN Mirror

This has always been my intuitive explanation as well, and seems related to how a typical over-parameterized net can memorize a randomly labelled training set, yet the same net will be able to generalize (opposite of memorize) if trained on a meaningfully labelled training set (i.e. one where the labels are non-random, and correspond to features in the training data).

With the randomly labelled dataset these activation "modes" are essentially gerrymandered to fit the data since the datapoints have no common features correlated to the labels to cause it to do otherwise.

With the meaningfully labelled dataset, and a smooth loss landscape, multiple datapoints with common features & labels will be pushing these activation modes in the same direction creating "high density modes" within which meaningful generalization occurs.

Generalization, or lack of it, is of course also intimately related to adversarial attacks. It seems that what is going on there is that these high density modes are only disconnected from each other (by areas of low density) when considering the degrees of freedom of data on the training set manifold. In the unconstrained input space off the natural data manifold, these high density areas of different generalization are likely to be connected and it's easy to select an "unnatural feature" that will push a datapoint from mapping to one mode to another.

I've suggested this explanation of generalization a number of times over the years, and always had negative feedback from folk who think there's more to the "generalization mystery" than this.