Hacker News new | ask | show | jobs
by brucephillips 3113 days ago
> If you ask folks in nonlinear optimization, they'll tell you that DL is not possible.

I sincerely doubt anyone who knows more than one sentence about deep learning would say that, since deep learning doesn't claim to optimize.

3 comments

i suspect that what he's referring to is that he's heuristically minimizing a somewhat arbitrary (loss) function in a million-ish dimensions using the simple variants of gradient descent that work under these conditions. it sounds far too WIBNI to produce good results reliably (in practice, let alone in theory). the landscape has so many stationary points at which to get stuck; why would you ever get good results?

there's a small cottage industry of papers (like [0]) that try to explain this.

[0] https://arxiv.org/pdf/1412.0233.pdf

I think this recent paper [1] sheds quite a bit of light on this.

[1] https://arxiv.org/abs/1703.00810v3

Really don't think that's the best paper to say "sheds quite a bit of light on this". That paper has been somewhat controversial since it came out.

I think https://arxiv.org/abs/1609.04836 is seminal in showing unsharp minima = generalization, the parent's paper is good for showing that gradient descent over non-convex surfaces works fine, https://arxiv.org/abs/1611.03530 is landmark for kicking off this whole generalization business (mainly shows that traditional models of generalization, namely VC dimension and ideas of "capacity" don't make sense for neural nets).

You are right. Unfortunately, many (doubly unfortunately, even in academia, well, many who switched careers in optimization to ML) think that machine learning is just optimization.

Regarding deep NNs, one should be careful with what one wishes for, because sometimes they come true. Landing up with the global optimum of that thing would likely be the last thing one wants.

The key to deep NNs is to do such a pathetic job of optimizing the loss that the generalization is good. A problem is that there several different ways of doing a job poorly, not all of them would generalize well. When I have my engineer hat on, I would rather not have lots of indeterminism on my watch if I can afford it. Too dang hard to maintain correctness of.

On the other hand if one has a "with high probability" style result where the probabilities are high enough to be practically relevant, then we have something more workable.

I don't understand why you don't want a global optimum. Is this obvious? Does the following paragraph explain it, because I don't see the connection.
It happens when practitioners generalize theorems to scenarios that look similar but don't apply. The common pattern is misapplying an infinite set theorem to finite set case. If you don't know about the theorem in question to begin with, there is no way for you to misrepresent it.