Hacker News new | ask | show | jobs
by gwern 1734 days ago
You're looking for flat minima / wide basins. (Amusingly, this one actually does go back to Schmidhuber etc.) Explains a lot of phenomenon like poorer generalization of second-order optimizers, SGD sometimes working surprisingly better, stochastic weight averaging / EMA, grokking, or patient teachers.