|
I think there may be some confusion as to the terminology in the article. The gradient does not necessarily lead to the greatest decrease in the function. Specifically, if our function of interest is f and we're currently at the point x, then f(x - alpha grad f(x)) for some alpha may or may not be less that f(x + beta dx) for a different dx and beta. For example, consider the quadratic f(x) = 0.5 x' A x - b' x where ' denotes transpose and we have implicit multiplication. Let A = [2 1;1 2], b = [3; 4], x = [5;6]. This is convex and quadratic with a global minima of [2/3;5/3], inv(A) b. Now, grad f(x) = A x - b = [13;13] and moving in the direction [-13;-13] will not get us there in a single step. However, dx_newton = -inv(hess f(x)) grad f(x) = [-4 1/3, -4 1/3], which brings us to the global minima in a single step. The value in gradient descent is that, combined with an appropriate globalization technique such as a trust region or a line search, it guarantees convergence to a local minima. Newton's method does not unless close enough to the minima. As such, most good, fast optimization algorithms based on differentiable functions use the steepest descent direction as a metric, or fallback, to guarantee convergence and then use a different direction, most likely a truncated-Newton method, to converge quickly. Meaning, the gradient descent direction rarely leads to the greatest decrease. Unless, of course, we want to make an argument in an infinitesimal sense, which fine, but I'd denote that explicitly. |
Your point about combining optimisation techniques is interesting and I'd love to learn about it a little more. When you say "As such, most good, fast optimization algorithms based on differentiable functions use the steepest descent direction as a metric, or fallback, to guarantee convergence and then use a different direction, most likely a truncated-Newton method, to converge quickly", does this mean that both algorithms are being used together? So first steepest descent is run for a few iterations and then the truncated-Newton method takes over?
If you have some resources where I could read up on this it would be much appreciated!