|
|
|
|
|
by woopwoop
1514 days ago
|
|
Yeah I didn't mean to imply "Why does SGD result in lower training loss than the initial weights" is an open question. But I don't think even lolcatz would call that a sufficient explanation. After all if the only criterion is "improves on initial training loss" you could just try random weights and pick the best one. The non-convexity makes sgd already pretty mysterious, and that is without even getting into the generalization performance, which seems to imply that somehow sgd is implicitly regularizing. |
|
The question then becomes: why does this generalize [4], given that the classical theory of Vapnik and others [5] becomes vacuous, no longer guaranteeing lack of over-fitting?
This is less well understood, although there is recent theoretical work here too.
[1] Lee et al (2019). Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent. https://proceedings.neurips.cc/paper/2019/hash/0d1a9651497a3...
[2] Allen-Zhu et al (2019). A convergence theory for deep learning via over-parameterization. https://proceedings.mlr.press/v97/allen-zhu19a.html
[3] Du et al (2019). Gradient Descent Finds Global Minima of Deep Neural Networks. http://proceedings.mlr.press/v97/du19c.html
[4] Zhang et al (2016). Understanding deep learning requires rethinking generalization.
[5] Vapnik (1999). The nature of statistical learning theory. https://arxiv.org/abs/1611.03530