| I did not understand the paper very well. 1. It's theoretically impossible to guarantee a convergence to global optima using gradient descent if the function is non-convex. 2. The only way to guarantee is to start the gradient descent from different points in the search space or try with different step sizes if the algorithm only starts from the same point in the search space. 3. Also does "achieving zero training loss" mean the network has converged to the global optima? I used to know you will get zero training loss even if you are at a local minima as well. Please correct me if I am wrong. |
This is false. See, e.g., [0][1].
2. I'm not really sure what the question is here.
3. If your loss is bounded from below (it is a square norm) by 0 and you achieve 0 loss, this means that 0 is a global optimum, since, by definition, no other objective value can be smaller than this number.
---
[0] Theorem A.2 in Udell's Generalized Low-Rank models paper https://arxiv.org/pdf/1410.0342.pdf
[1] B&V Convex Optimization (https://web.stanford.edu/~boyd/cvxbook/), Appendix B.1. In fact, I can't find the reference right now, but you can easily prove that GD with an appropriate step-size converges to a global optimum on this problem when initialized at (0,0), even though the problem is non-convex.