|
|
|
|
|
by DoctorOetker
2780 days ago
|
|
that reference you can't find right now seems rather pertinent? I think the OP intended and should have written: "It's theoretically impossible to guarantee a convergence to global optima using gradient descent for an arbitrary non-convex function." For example consider the function f(x)=sin^2(pi * x)+sin^2(pi * N/x) this function has multiple global minima at the divisors of N, where it is f(x)==0, if x or N/x is non-integer, it is guaranteed to be positive... I am not taking a stance on if gradient descent does or does not guarantee finding global minima and is thus able to factorize cryptographic grade RSA products of primes, but the claim does appear to imply it. Edit: the multiply symbols changed some cursive |
|
Here it is: https://arxiv.org/pdf/1707.08706.pdf (This isn't quite the one I was thinking of, so I'll dig a little deeper, but it covers the idea).
Some slightly more technical conditions have to hold in order to have vanilla GD work (since the function is non-differentiable at points), but a (very!) slightly generalized variant for convex functions—sub-gradient descent—works.
> for an arbitrary non-convex function.
Sure, but this is also obvious since it is NP-hard to reach global optima in arbitrary non-convex problems. Additionally, specifically on the case of GD, I can give simple examples that always fail (consider f(x) = 1 everywhere except f(0) = 0. GD always fails whenever the initial point, x_0 ≠ 0, since the gradient is zero everywhere, except at one point. Additionally, picking initializations randomly, we reach the global minimum with probability 0 whenever we have support with non-empty interior).
I'm afraid I disagree that this is what the OP intended, though, and I also disagree that the paper's claim implies what you've said, since they only study a very specific subproblem (e.g. minimization of empirical loss on ResNets).
The relative "ease" of this task vs solving arbitrary NP-hard problems is not difficult to believe, since, given a bunch of training examples, I can always generate a resnet that fits those examples perfectly (i.e. with zero loss) in poly-space in a very dumb way: first, generate a circuit that matches the look-up table of the training samples (which is poly-space in the number of samples and can be done in poly-time), then map that circuit to an NN.