|
|
|
|
|
by DoctorOetker
2775 days ago
|
|
>Some slightly more technical conditions have to hold in order to have vanilla GD work (since the function is non-differentiable at points) which function is non-differentiable at points? if you refer to my example, it is only nondifferentiable at x=0 and x=inf which are both uninteresting points since they arent divisors of N, for all the rest f(x) I gave is differentiable and lipschitz continuous of order infinity This in contrary to your pathological example of f(x)= { 1 (x!=0); 0 (x==0) ... of course GD can not work there, and I wouldn't fault the paper for it... don't misunderstand me, the paper is interesting, but the title and certain phrasings are very misleading IMHO Still I think the approach by others is more interesting: by looking at the absolute error between a fixed underlying NN as "ground truth" and observing the error of the training NN (of same architecture as ground truth NN) trained to match the underlying NN |
|
Sorry, this was referring to the construction provided in the paper referenced.
I do agree that the title is somewhat misleading, since, when I first read it (and thought, "this is probably wrong"), I imagined that it proved that given any resnet, you can show convergence to the global optimum via GD, not just "a resent of a given size converges to a global optimum, via GD, for a specific training set."
That being said, the paper does not prove (nor claim to prove) general, globally-optimal convergence of GD, which is what I think you're saying (given, for example, what you mentioned about finding the factorization of a semiprime in the GGP and your specific function construction)—which is what I was pushing back against a bit. In particular, even in the title, they only claim to prove this for a specific class of problems (i.e. NNs).
> Still I think the approach by others is more interesting: by looking at the absolute error between a fixed underlying NN as "ground truth" and observing the error of the training NN (of same architecture as ground truth NN) trained to match the underlying NN
I'm afraid I haven't seen this approach, but it would be interesting. Do you have references?