|
|
|
|
|
by yobbo
1437 days ago
|
|
The loss function is on the parameter space, and "wide basins" having better generalizations is equivalent to saying regularizing (in whatever way) gives better generalization, since regularization constrains the parameter/and or function space in that way. In small (two or three) dimensions, there are ways of visualizing overtraining/regularization/generalization with scatter plots (maybe coloured with output label) of activations in each layer. Training will form tighter "modes" in the activations, and the "low density" space between modes constitutes "undefined input space" to subsequent layers. Overtraining is when real data falls in these "dead" regions. The aim of regularization is to shape the activation distributions such that unseen data falls somewhere with non-zero density. Training loss does not give any information on generalization here unless it shows you're in a narrow "well". The loss landscapes are high-dimensional and non-obvious to reason about, even in tiny examples. |
|
With the randomly labelled dataset these activation "modes" are essentially gerrymandered to fit the data since the datapoints have no common features correlated to the labels to cause it to do otherwise.
With the meaningfully labelled dataset, and a smooth loss landscape, multiple datapoints with common features & labels will be pushing these activation modes in the same direction creating "high density modes" within which meaningful generalization occurs.
Generalization, or lack of it, is of course also intimately related to adversarial attacks. It seems that what is going on there is that these high density modes are only disconnected from each other (by areas of low density) when considering the degrees of freedom of data on the training set manifold. In the unconstrained input space off the natural data manifold, these high density areas of different generalization are likely to be connected and it's easy to select an "unnatural feature" that will push a datapoint from mapping to one mode to another.
I've suggested this explanation of generalization a number of times over the years, and always had negative feedback from folk who think there's more to the "generalization mystery" than this.