|
|
|
|
|
by gwern
1734 days ago
|
|
You're looking for flat minima / wide basins. (Amusingly, this one actually does go back to Schmidhuber etc.) Explains a lot of phenomenon like poorer generalization of second-order optimizers, SGD sometimes working surprisingly better, stochastic weight averaging / EMA, grokking, or patient teachers. |
|