Hacker News new | ask | show | jobs
by mturmon 3093 days ago
Following a gradient is smarter than trial and error. You can make an argument that, in high-dimensional parameter spaces, it’s hard to do better (because, gradient descent is linear in the number of dimensions).

Ordinary Metropolis-Hastings, for example, is closer to trial and error.

2 comments

> You can make an argument that, in high-dimensional parameter spaces, it’s hard to do better (because, gradient descent is linear in the number of dimensions).

Nitpick: If you're interested in solving optimisation problems, it's very easy to do better than gradient descent. Gradient descent performs very poorly when the directional curvature of the objective function varies too much with the direction. Newton's method, quasi-Newton methods, and nonlinear conjugate gradient are some of the more ingenious, beautiful, and clean ways; there are also some dirty, hacky ways to go.

It is a little bit interesting that fancier optimisation algorithms than gradient descent are unnecessary or unhelpful in some large-scale applications.

You mention Newton’s method, but of course that requires second order information which, as I mentioned, is not generally workable in high dimensions. You have to be careful with quasi-Newton methods like conjugate gradient for the same reason.
> You mention Newton’s method, but of course that requires second order information which, as I mentioned, is not generally workable in high dimensions.

Why would you say that second-order information is "not generally workable in high dimensions"? We regularly run Newton's method on problems with tens of millions of variables today.

And Newton's method isn't the only way to use second-order information. It is easy to access, for example, Hessian-times-vector information using the same reverse-mode differentiation that's so popular today, using only a constant factor more time.

> You have to be careful with quasi-Newton methods like conjugate gradient for the same reason.

What reason, exactly, is that?

I meant, in general settings (ie, no special problem structure), that you need full Hessians for Newton’s method. And, regarding conjugate gradient, in (non-DL) settings that I’m used to, that for good results you need preconditioners which are also second order.

Could you provide a reference to a 10^7 size problem that is being optimized with Newton’s method? I’d be indebted.

> I meant, in general settings (ie, no special problem structure), that you need full Hessians for Newton’s method. And, regarding conjugate gradient, in (non-DL) settings that I’m used to, that for good results you need preconditioners which are also second order.

Yep, but often sparsity is present or the objective function is reasonably well-behaved. The former can save straight Newton; the latter will make nonlinear CG or quasi-Newton methods converge rapidly. (Quasi-Newton methods build a simple model of the Hessian from gradient and step information, usually using repeated low-rank modifications of a diagonal matrix. There are variants, like L-BFGS, that have an explicit, tunable limit on the number of additional vectors to be stored. This work really well for some reason---usually far better than gradient descent, and almost never more than a small constant factor slower)

> Could you provide a reference to a 10^7 size problem that is being optimized with Newton’s method? I’d be indebted.

Interior-point methods for linear programming form the Hessian of a barrier function at each iteration, then (sparse) Cholesky factorise it, then do a few backsolves to find a search direction. (Special hacks generally go into these "few backsolves" to speed convergence.) This is a damped Newton method. Commercial implementations include CPLEX, Gurobi, and MOSEK; there are a great many less-commercial implementations as well.

Chih-Jen Lin's LIBLINEAR uses a truncated Newton method (solve Hessian*direction=gradient by linear conjugate gradient, stopping early when sufficient accuracy has been obtained to take a step) to create linear classifiers for data.

> because, gradient descent is linear in the number of dimensions

Yes, but increasing dimensions does not mean the manifold of the problem space increases the same.