| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by PartiallyTyped 1526 days ago
	Well, quadratic convergence usually requires the Hessian, or an approximation of it, and that's difficult to get in deep learning due to memory constrains, and difficulty computing second order derivatives. Computing the derivatives is not very difficult with e.g. Jax, but ... you get back to the memory issue. The Hessian is a square matrix, so in Deep Learning, if we have a million of parameters, then the Hessian is a 1 trillion square matrix...

2 comments

tome 1526 days ago

Not only does it have 1 trillion elements, you also have to invert it!

link

PartiallyTyped 1526 days ago

Indeed! BFGS (and derivatives) approximate the inverse but they have other issues that make them prohibitively expensive.

link

SleekEagle 1526 days ago

https://c.tenor.com/enoxmmTG1wEAAAAC/heart-attack-in-pain.gi...

link

ssivark 1526 days ago

To add, one could think of schemes like "momentum" and cousins as attempts to estimate something in the spirit of the inverse Hessian using various hacks/heuristics.

link