Hacker News new | ask | show | jobs
by electricslpnsld 3093 days ago
If you can’t reasonably get at or use second order information, how else are you going to optimize arbitrary objectives?

Well, come to think of it it, why don’t DL approaches use BFGS instead of gradient descent?

4 comments

There is literature on Quasi-Newton and Krylov Subspace methods for training Neural Networks. For example, https://dl.acm.org/citation.cfm?id=3104516.

I think the primary reason that such methods are not used much in practice is memory and computational cost: each function evaluation is expensive and you need to solve a very large system at every iteration.

Also to reply to a sibling comment, you can add momentum and step length adjustments to second-order methods in much the same way as in steepest-descent to help escape saddles. The only difference is how the descent direction is chosen for the optimization.

This is correct - second order methods are great in theory, but they are generally computationally prohibitive for high dimensional problems.
Second order methods are attracted to saddle points in high dimensional spaces. The math and practice of optimizing these surfaces has a lot of nuances like this so much of the stuff you learn in your convex optimization class doesn't apply too well.
Do you have any recommendations on sources to read about this? Everything I've read discusses the use of the Hessian to not only determine you are at a saddle point but to also use its eigenvalues to escape.
I have not used it but there is an implementation in PyTorch: http://pytorch.org/docs/master/optim.html#torch.optim.LBFGS
the question is - why do you need to optimize in the first place? why don't you look up an answer instead of solving a mathematical optimization problem?