| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by electricslpnsld 3093 days ago
	If you can’t reasonably get at or use second order information, how else are you going to optimize arbitrary objectives? Well, come to think of it it, why don’t DL approaches use BFGS instead of gradient descent?

4 comments

fwilliams 3093 days ago

There is literature on Quasi-Newton and Krylov Subspace methods for training Neural Networks. For example, https://dl.acm.org/citation.cfm?id=3104516.

I think the primary reason that such methods are not used much in practice is memory and computational cost: each function evaluation is expensive and you need to solve a very large system at every iteration.

Also to reply to a sibling comment, you can add momentum and step length adjustments to second-order methods in much the same way as in steepest-descent to help escape saddles. The only difference is how the descent direction is chosen for the optimization.

link

steev 3093 days ago

This is correct - second order methods are great in theory, but they are generally computationally prohibitive for high dimensional problems.

link

jph00 3093 days ago

Second order methods are attracted to saddle points in high dimensional spaces. The math and practice of optimizing these surfaces has a lot of nuances like this so much of the stuff you learn in your convex optimization class doesn't apply too well.

link

steev 3093 days ago

Do you have any recommendations on sources to read about this? Everything I've read discusses the use of the Hessian to not only determine you are at a saddle point but to also use its eigenvalues to escape.

link

plafl 3093 days ago

I have not used it but there is an implementation in PyTorch: http://pytorch.org/docs/master/optim.html#torch.optim.LBFGS

link

bra-ket 3090 days ago

the question is - why do you need to optimize in the first place? why don't you look up an answer instead of solving a mathematical optimization problem?

link