| HN Mirror

> I meant, in general settings (ie, no special problem structure), that you need full Hessians for Newton’s method. And, regarding conjugate gradient, in (non-DL) settings that I’m used to, that for good results you need preconditioners which are also second order.

Yep, but often sparsity is present or the objective function is reasonably well-behaved. The former can save straight Newton; the latter will make nonlinear CG or quasi-Newton methods converge rapidly. (Quasi-Newton methods build a simple model of the Hessian from gradient and step information, usually using repeated low-rank modifications of a diagonal matrix. There are variants, like L-BFGS, that have an explicit, tunable limit on the number of additional vectors to be stored. This work really well for some reason---usually far better than gradient descent, and almost never more than a small constant factor slower)

> Could you provide a reference to a 10^7 size problem that is being optimized with Newton’s method? I’d be indebted.

Interior-point methods for linear programming form the Hessian of a barrier function at each iteration, then (sparse) Cholesky factorise it, then do a few backsolves to find a search direction. (Special hacks generally go into these "few backsolves" to speed convergence.) This is a damped Newton method. Commercial implementations include CPLEX, Gurobi, and MOSEK; there are a great many less-commercial implementations as well.

Chih-Jen Lin's LIBLINEAR uses a truncated Newton method (solve Hessian*direction=gradient by linear conjugate gradient, stopping early when sufficient accuracy has been obtained to take a step) to create linear classifiers for data.