| HN Mirror

Though I have complaints with it, Numerical Optimization by Nocedal and Wright is probably the best reference for modern optimization techniques. My complaint with it is that they also present many historical techniques that I would argue should not be used and don't provide clear guidance as to what are the modern, robust algorithms. And, to be sure, arguments can be made for all sorts of algorithms, but I will contend: (unconstrained) trust-region newton-cg [algorithm 7.2 in Numerical Optimization], (equality) composite-step SQP method [algorithm 15.4.1 in Trust-Region Methods by Conn, Gould, and Toint], (inequality) NITRO interior point algorithm [algorithm 19.4 in Numerical Optimization], (equality and inequality) combination of the above. There are many implementation nuances with these algorithms and they can be made better than their presentation, but I believe them to be a good starting point for modern, fast algorithms.

As far as switching back and forth between the Newton and gradient descent steps, this is largely done in a class of algorithms called dogleg methods. Essentially, the Newton step is tried against some convergence criteria. If it satisfies this criteria, it takes a step. If not, it reduces itself until eventually it assumes the gradient descent step. I'll contend that truncated-CG (Steihaug-Toint CG) does this, but better. Essentially, it's a modified conjugate gradient algorithm to solve the Newton system that maintains a descent direction. The first Krylov vector this method generates is the gradient descent step, so it eventually reduces to this step if convergence proves difficult.

More broadly, there's a question of whether all of the trouble of using second-order information (Hessians) is worth it away from the optimal solution. I will contend, strongly, yes. I base this on experience, but there are some simple thought experiments as well. For example, say we have the gradient descent direction. How far should we travel in this direction? Certainly, we can conduct a line-search or play with a "learning parameter". Also, if you do this, please use a line-search because it will provide vastly better convergence guarantees and performance. However, if we have the second derivative, we have a model to determine how far we need to go. Recall, a Taylor series tells us that f(x + dx) ~= f(x) + grad f(x)'dx + 0.5 dx' hess f(x) dx. We can use this to figure out how far to travel in this direction where we try to find an optimal alpha such that J(alpha) = f(x + alpha dx) = f(x) + alpha grad f(x)'dx + (alpha/2) dx' hess f(x) dx. If dx' hess f(x) dx > 0, the problem is convex and we can simply look for when J'(alpha) = 0, which occurs when alpha = -grad f(x)' dx / (dx' hess f(x) dx). When dx' hess f(x) dx < 0, this implies that we should take a really long step as this is predicting the gradient will be even more negative in this direction the farther we go. Though both methods, must be safeguarded (the easiest is to just halve the step if we don't get descent), the point is that the Hessian provides information that the gradient did not and this information is useful. This is only one place where this information can be use, others include in the direction calculation itself, which is what truncated-CG does.

As a brief aside, the full Hessian is rarely, if ever, computed. Hessian-vector products are enough, which allows the problem to scale to really anything that a gradient descent method can scale to.

As one final comment, the angle observation that you make in the blog post is important. It comes in a different form when proving convergence of methods, which can be seen in Theorem 3.2 within Numerical Optimization, which uses expression 3.12. Essentially, to guarantee convergence, the angle between the gradient descent direction and whatever we choose must be controlled.