| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ramgorur 2816 days ago

I did not understand the paper very well.

1. It's theoretically impossible to guarantee a convergence to global optima using gradient descent if the function is non-convex.

2. The only way to guarantee is to start the gradient descent from different points in the search space or try with different step sizes if the algorithm only starts from the same point in the search space.

3. Also does "achieving zero training loss" mean the network has converged to the global optima? I used to know you will get zero training loss even if you are at a local minima as well.

Please correct me if I am wrong.

4 comments

LolWolf 2816 days ago

1. > It's theoretically impossible to guarantee a convergence to global optima using gradient descent if the function is non-convex.

This is false. See, e.g., [0][1].

2. I'm not really sure what the question is here.

3. If your loss is bounded from below (it is a square norm) by 0 and you achieve 0 loss, this means that 0 is a global optimum, since, by definition, no other objective value can be smaller than this number.

---

[0] Theorem A.2 in Udell's Generalized Low-Rank models paper https://arxiv.org/pdf/1410.0342.pdf

[1] B&V Convex Optimization (https://web.stanford.edu/~boyd/cvxbook/), Appendix B.1. In fact, I can't find the reference right now, but you can easily prove that GD with an appropriate step-size converges to a global optimum on this problem when initialized at (0,0), even though the problem is non-convex.

DoctorOetker 2816 days ago

that reference you can't find right now seems rather pertinent?

I think the OP intended and should have written:

"It's theoretically impossible to guarantee a convergence to global optima using gradient descent for an arbitrary non-convex function."

For example consider the function f(x)=sin^2(pi * x)+sin^2(pi * N/x) this function has multiple global minima at the divisors of N, where it is f(x)==0, if x or N/x is non-integer, it is guaranteed to be positive...

I am not taking a stance on if gradient descent does or does not guarantee finding global minima and is thus able to factorize cryptographic grade RSA products of primes, but the claim does appear to imply it.

Edit: the multiply symbols changed some cursive

LolWolf 2815 days ago

> that reference you can't find right now seems rather pertinent?

Here it is: https://arxiv.org/pdf/1707.08706.pdf (This isn't quite the one I was thinking of, so I'll dig a little deeper, but it covers the idea).

Some slightly more technical conditions have to hold in order to have vanilla GD work (since the function is non-differentiable at points), but a (very!) slightly generalized variant for convex functions—sub-gradient descent—works.

> for an arbitrary non-convex function.

Sure, but this is also obvious since it is NP-hard to reach global optima in arbitrary non-convex problems. Additionally, specifically on the case of GD, I can give simple examples that always fail (consider f(x) = 1 everywhere except f(0) = 0. GD always fails whenever the initial point, x_0 ≠ 0, since the gradient is zero everywhere, except at one point. Additionally, picking initializations randomly, we reach the global minimum with probability 0 whenever we have support with non-empty interior).

I'm afraid I disagree that this is what the OP intended, though, and I also disagree that the paper's claim implies what you've said, since they only study a very specific subproblem (e.g. minimization of empirical loss on ResNets).

The relative "ease" of this task vs solving arbitrary NP-hard problems is not difficult to believe, since, given a bunch of training examples, I can always generate a resnet that fits those examples perfectly (i.e. with zero loss) in poly-space in a very dumb way: first, generate a circuit that matches the look-up table of the training samples (which is poly-space in the number of samples and can be done in poly-time), then map that circuit to an NN.

DoctorOetker 2815 days ago

>Some slightly more technical conditions have to hold in order to have vanilla GD work (since the function is non-differentiable at points)

which function is non-differentiable at points? if you refer to my example, it is only nondifferentiable at x=0 and x=inf which are both uninteresting points since they arent divisors of N, for all the rest f(x) I gave is differentiable and lipschitz continuous of order infinity

This in contrary to your pathological example of f(x)= { 1 (x!=0); 0 (x==0) ... of course GD can not work there, and I wouldn't fault the paper for it...

don't misunderstand me, the paper is interesting, but the title and certain phrasings are very misleading IMHO

Still I think the approach by others is more interesting: by looking at the absolute error between a fixed underlying NN as "ground truth" and observing the error of the training NN (of same architecture as ground truth NN) trained to match the underlying NN

LolWolf 2815 days ago

> which function is non-differentiable at points?

Sorry, this was referring to the construction provided in the paper referenced.

I do agree that the title is somewhat misleading, since, when I first read it (and thought, "this is probably wrong"), I imagined that it proved that given any resnet, you can show convergence to the global optimum via GD, not just "a resent of a given size converges to a global optimum, via GD, for a specific training set."

That being said, the paper does not prove (nor claim to prove) general, globally-optimal convergence of GD, which is what I think you're saying (given, for example, what you mentioned about finding the factorization of a semiprime in the GGP and your specific function construction)—which is what I was pushing back against a bit. In particular, even in the title, they only claim to prove this for a specific class of problems (i.e. NNs).

> Still I think the approach by others is more interesting: by looking at the absolute error between a fixed underlying NN as "ground truth" and observing the error of the training NN (of same architecture as ground truth NN) trained to match the underlying NN

I'm afraid I haven't seen this approach, but it would be interesting. Do you have references?

DoctorOetker 2815 days ago

Thanks for your added comments, it's really helpful to see a more candid breakdown of others views of a paper.

>I'm afraid I haven't seen this approach, but it would be interesting. Do you have references?

They are referenced in the paper in the section:

>Another way to attack this problem is to study the dynamics of a specific algorithm for a specific neural network architecture. Our paper also belongs to this category. Many previous works put assumptions on the input distribution and assume the label is generated according to a planted neural network. Based on these assumptions, one can obtain global convergence of gradient descent for some shallow neural networks [Tian, 2017, Soltanolkotabi, 2017, Brutzkus and Globerson, 2017, Du et al., 2018a, Li and Yuan, 2017, Du et al., 2017b]. Some local convergence results have also been proved [Zhong et al., 2017a,b, Zhang et al., 2018]. In comparison, our paper does not try to recover the underlying neural network. Instead, we focus the empirical loss minimization problem and rigorously prove that randomly initialized gradient descent can achieve zero training loss.

I had this idea independently but never pursued it due to lack of time. It's another reason I like this paper: for referencing this approach, at least if they investigate what I think they do, I still havent had time to read those references, but from the description in this section it appears they investigate nearly if not exactly what I wanted to investigate "some day" :)

nshm 2816 days ago

The title of the paper is really misleading. The comments here are even more misleading. The key is their theorem where they say "with high probably over random initialization". They initialize many times and sometimes it converges. Single initialization can stuck in local minimum of course.

DoctorOetker 2816 days ago

but then there is very little of interest: (assuming enough smoothness and Lipschitz continuity) one expects every global minima to have a convex neighbourhood such that gradient descent starting within the neighbourhood reaches the global minimum. The initialize many times and sometimes it converges is just saying the obvious "there exist initial positions for which GD succeeds in finding a global minima"... is my interpretation correct?

nshm 2816 days ago

The contribution of the paper is the estimation of the convergence speed and number of parameters in neural network, that seems a valuable point.

DoctorOetker 2816 days ago

The paper states for the main result:

>In this section, we show gradient descent with a constant positive step size converges to the global minimum with a linear rate.

This is rather ambiguous: it sounds like it guarantees "it WILL converge to the global minimum, btw at a linear rate" but I suspect they are really saying "IF it converges to the global minimum, THEN it will converge at a linear rate" in a way to purpousefully sound like the first statement.

could you comment on if GD does or does not find the global minimum of the integer factorization cost function in the following comment?

https://news.ycombinator.com/item?id=18439287

nshm 2815 days ago

Another sin they have is that they write "with high probability" in the theorem but they do not strictly define that in the main section of the paper. If you look into appendix you'll see that they guarantee the probability 1 - \delta and delta affects the convergence. It means that if you add many many parameters then you just need a couple attempts to converge well. So "IF it converges" is becoming much better "VERY OFTEN it converges". Sorry, no real passion to read the paper in details, so this is just an intuition from a quick look.

TTPrograms 2816 days ago

1) Just because you can prove convergence for convex functions does not mean you can't prove it for non-convex functions.

2) This is "a way" not "The only way".

(If A then B) does not imply (if not A then not B)

ramgorur 2816 days ago

>Just because you can prove convergence for convex functions does not mean you can't prove it for non-convex functions.

I don't understand. How do you prove a gradient descent is guaranteed to escape local minima?

heyitsguay 2816 days ago

Convex functions aren't the only functions with a single local (and global) minimum - consider sqrt(|x|) for a simple 1d example.

ramgorur 2816 days ago

Yes, that's true. But in optimization domain the concept of "convexity" is understood in terms of set, not always from the 2nd derivative of a function. Because you might have search spaces where you are not able to differentiate the objective function at all. In those cases the "convex" means a "convex set".

LolWolf 2816 days ago

Sure: you can define convexity for a function that is equivalent to the 2nd derivative definition in the case that the function is twice-differentiable, using only the definition of the convexity of a set.

Define the epigraph of a function to be the set given by {(x, t) | f(x) ≤ t}. Then, we say f is a convex function iff the epigraph is a convex set.

This is equivalent (exercise for the reader!) to the usual definition that a function f is convex iff f((1-t)x + ty) ≤ (1-t)f(x) + tf(y), for all 0 ≤ t ≤ 1, with x, y in the domain of f.

Note that neither of these two definitions require differentiability (or twice-differentiability), but the definitions are equivalent in this case.[0]

---

[0] For proofs of all of these statements see B&V's Convex Optimization.

TTPrograms 2816 days ago

When talking about "convex optimization" one nearly always means that both the function and the domain are convex.

srean 2816 days ago

No need to define 'convex optimization' here, that follows from the definition of a convex function: F(ax + (1-a)y) <= aF(x) + (1-a) F(y) for 0<=a<=1 and x,y in domain of F. For the inequality to be satisfied ax + (1-a)y has to be in the domain. That mean the domain is convex.

RandyRanderson 2816 days ago

FF NNs of even one hidden layer are universal approximators. That is, they do find the global min. What this doesn't tell you is that it's likely a huge graph and will take a looong time to optimize for even trivial data sets. There's lots of proofs around. That's why SGD is used, and for only a small subset of training points at a time.

Re 2: No.

Re 3: Yes.

simonhughes22 2816 days ago

They are universal approximators, although the term approximator means there's an upper bound on the accuracy of how well they emulate a given function (based on the size of the network). However, just because they are universal approximators doesn't mean that you can automatically infer the optimal number of connections and weight for each of those connections in order to minimize the loss over some dataset (derived from some function). Being able to be a universal approximator does not imply you can automatically learn the best approximation of a given function. It's the difference between being capable of learning something, and having learned it. If that makes sense.

srean 2816 days ago

I dont think you understand what universal approximation means. It means there are parameter settings that would reduce the approximation as much as you want. Its an existential property. It does not mean that those parameters can be found.

Anyway this universal property of neural networks get a lot of airtime and people go gaga over it. Its a complete red herring. Its not the first example of a universal approximation and not the last. There is no scarcity of universal approximators. There was no such scarcity even 100s of years ago. The explanation of the success of DNNs lie elsewhere.