Hacker News new | ask | show | jobs
by aoki 3112 days ago
i suspect that what he's referring to is that he's heuristically minimizing a somewhat arbitrary (loss) function in a million-ish dimensions using the simple variants of gradient descent that work under these conditions. it sounds far too WIBNI to produce good results reliably (in practice, let alone in theory). the landscape has so many stationary points at which to get stuck; why would you ever get good results?

there's a small cottage industry of papers (like [0]) that try to explain this.

[0] https://arxiv.org/pdf/1412.0233.pdf

1 comments

I think this recent paper [1] sheds quite a bit of light on this.

[1] https://arxiv.org/abs/1703.00810v3

Really don't think that's the best paper to say "sheds quite a bit of light on this". That paper has been somewhat controversial since it came out.

I think https://arxiv.org/abs/1609.04836 is seminal in showing unsharp minima = generalization, the parent's paper is good for showing that gradient descent over non-convex surfaces works fine, https://arxiv.org/abs/1611.03530 is landmark for kicking off this whole generalization business (mainly shows that traditional models of generalization, namely VC dimension and ideas of "capacity" don't make sense for neural nets).