|
|
|
|
|
by MauranKilom
1437 days ago
|
|
The "wedge" part under "3. Mode Connectivity" has at least one obvious component: Neural networks tend to be invariant to permuting nodes (together with their connections) within a layer. Simply put, it doesn't matter in what order you number the K nodes of e.g. a fully connected layer, but that alone already means there are K! different solutions with exactly the same behavior. Equivalently, the loss landscape is symmetric to certain permutations of its dimensions. This means that, at the very least, there are many global optima (well, unless all permutable weights end up with the same value, which is obviously not the case). The fact that different initializations/early training steps can end up in different but equivalent optima follows directly from this symmetry. But whether all their basins are connected, or whether there are just multiple equivalent basins, is much less clear. The "non-linear" connection stuff does seem to imply that they are all in some (high-dimensional, non-linear) valley. To be clear, this is just me looking at these results from the "permutation" perspective above, because it leads to a few obvious conclusions. But I am not qualified to judge which of these results are more or less profound. |
|
The different solutions found in different runs likely share a lot of information, but learn some different things on the edges. It would be cool to isolate the difference between two networks...