|
|
|
|
|
by cgdl
456 days ago
|
|
Yes, and that's the problem. What Zhang et al [2] showed convincingly in the Rethinking paper is that just focusing on the hypothesis space cannot be enough since the same hypothesis space fits real and random data so it's already too large. Therefore, these methods that focus on the hypothesis space have to talk about a bias in practice towards a better subspace, and that already requires studying the specific optimization algorithm in order to understand why it picks certain hypothesis over others in the space. But once you are ready to do that then algorithmic stability is enough. You don't then need to think about Bayesian ensembles, or other proxies/simplifications etc. but can focus on just the specific learning setup you have. BTW algorithmic stability is not a new idea. An early version showed up within a few years of VC theory in the 80s in order to understand why nearest neighbors generalizes (it wasn't called algorithmic stability then though). If you are interested in this, also recommend [3]. [2] https://arxiv.org/abs/1611.03530 [3] https://arxiv.org/abs/1902.04742 |
|
"and that already requires studying the specific optimization algorithm in order to understand why it picks certain hypothesis over others in the space." But the OP paper explains how even "guess and check" can generalize similarly to SGD. It's becoming more well understood that the role of the optimizer may have been historically overstated for understanding DL generalization. It seems to be more about loss landscapes.
Don't get me wrong, these references you're linking are super interesting. But they don't take away from the OP paper which is adding something quite valuable to the discussion.