|
|
|
|
|
by smallcharleston
2355 days ago
|
|
The reason the gradient is preferred (as I understand it) is actually computational considerations. Assuming you can actually compute the both of them, a Newton method (uses the gradient and the derivative of the gradient, the Hessian) usually has faster convergence (quadratic instead of linear). However that Hessian can be big and difficult to compute, and then you need a linear solve with it. However in most ML applications, you don’t even have the full gradient. You have a stochastic estimate since your loss is generally additive in the data. So you’re not (as far as I know) even going to bother trying to form a Hessian. I believe many have investigated quasi-Newton methods based on estimate gradients but I haven’t investigated that thoroughly. |
|
My main objective was to highlight is that given that we are performing the classic gradient descent, the gradient will yield the greatest reduction in the function value. Essentially it was a point to highlight the underlying calculus. Wayne Winston in his book Operations Research: Applications and Algorithms has an interesting passage where he discusses the gradient being the direction of maximum increase (he was looking as steepest ascent).