|
|
|
|
|
by JunaidB
2360 days ago
|
|
That's a great point and to be honest I could have been a lot tighter with the terminology. Good advice to take on board for next time - thanks! Your point about combining optimisation techniques is interesting and I'd love to learn about it a little more. When you say "As such, most good, fast optimization algorithms based on differentiable functions use the steepest descent direction as a metric, or fallback, to guarantee convergence and then use a different direction, most likely a truncated-Newton method, to converge quickly", does this mean that both algorithms are being used together? So first steepest descent is run for a few iterations and then the truncated-Newton method takes over? If you have some resources where I could read up on this it would be much appreciated! |
|
As far as switching back and forth between the Newton and gradient descent steps, this is largely done in a class of algorithms called dogleg methods. Essentially, the Newton step is tried against some convergence criteria. If it satisfies this criteria, it takes a step. If not, it reduces itself until eventually it assumes the gradient descent step. I'll contend that truncated-CG (Steihaug-Toint CG) does this, but better. Essentially, it's a modified conjugate gradient algorithm to solve the Newton system that maintains a descent direction. The first Krylov vector this method generates is the gradient descent step, so it eventually reduces to this step if convergence proves difficult.
More broadly, there's a question of whether all of the trouble of using second-order information (Hessians) is worth it away from the optimal solution. I will contend, strongly, yes. I base this on experience, but there are some simple thought experiments as well. For example, say we have the gradient descent direction. How far should we travel in this direction? Certainly, we can conduct a line-search or play with a "learning parameter". Also, if you do this, please use a line-search because it will provide vastly better convergence guarantees and performance. However, if we have the second derivative, we have a model to determine how far we need to go. Recall, a Taylor series tells us that f(x + dx) ~= f(x) + grad f(x)'dx + 0.5 dx' hess f(x) dx. We can use this to figure out how far to travel in this direction where we try to find an optimal alpha such that J(alpha) = f(x + alpha dx) = f(x) + alpha grad f(x)'dx + (alpha/2) dx' hess f(x) dx. If dx' hess f(x) dx > 0, the problem is convex and we can simply look for when J'(alpha) = 0, which occurs when alpha = -grad f(x)' dx / (dx' hess f(x) dx). When dx' hess f(x) dx < 0, this implies that we should take a really long step as this is predicting the gradient will be even more negative in this direction the farther we go. Though both methods, must be safeguarded (the easiest is to just halve the step if we don't get descent), the point is that the Hessian provides information that the gradient did not and this information is useful. This is only one place where this information can be use, others include in the direction calculation itself, which is what truncated-CG does.
As a brief aside, the full Hessian is rarely, if ever, computed. Hessian-vector products are enough, which allows the problem to scale to really anything that a gradient descent method can scale to.
As one final comment, the angle observation that you make in the blog post is important. It comes in a different form when proving convergence of methods, which can be seen in Theorem 3.2 within Numerical Optimization, which uses expression 3.12. Essentially, to guarantee convergence, the angle between the gradient descent direction and whatever we choose must be controlled.