| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by smallcharleston 2402 days ago
	The reason the gradient is preferred (as I understand it) is actually computational considerations. Assuming you can actually compute the both of them, a Newton method (uses the gradient and the derivative of the gradient, the Hessian) usually has faster convergence (quadratic instead of linear). However that Hessian can be big and difficult to compute, and then you need a linear solve with it. However in most ML applications, you don’t even have the full gradient. You have a stochastic estimate since your loss is generally additive in the data. So you’re not (as far as I know) even going to bother trying to form a Hessian. I believe many have investigated quasi-Newton methods based on estimate gradients but I haven’t investigated that thoroughly.

2 comments

JunaidB 2402 days ago

This is a really important point and I wish I'd mentioned it. The computational considerations (as you've said) make the classic Gradient Descent method infeasible in practice. Therefore we resort to stochastic estimates or Quasi Newton approaches (which I'm still looking into).

My main objective was to highlight is that given that we are performing the classic gradient descent, the gradient will yield the greatest reduction in the function value. Essentially it was a point to highlight the underlying calculus. Wayne Winston in his book Operations Research: Applications and Algorithms has an interesting passage where he discusses the gradient being the direction of maximum increase (he was looking as steepest ascent).

link

chestervonwinch 2401 days ago

> You have a stochastic estimate since your loss is generally additive in the data.

Just to be clear, additive loss doesn't imply stochastic gradient estimate. Rather, because the loss function is additive, then stochastic gradient estimates of the loss are now possible. But, this of course does not mean one has to use stochastic gradient estimates.

It's just that it's easier to update and monitor progress this way, rather than computing the gradient term for every single example in the training set and then taking a descent step. The surprising thing is that stochastic gradient descent convergences quickly in practice relative to proper gradient descent. All of the justification and whatnot for SGD for ML is largely post-hoc because it works so unreasonably well and is so intuitive to anyone having taken calculus.

The other aspect (with respect to the context of optimization in machine learning) is that this optimization is performed over a loss over a training dataset for which you really don't even want convergence to an exact minima over the training loss. What you really care about is the expected generalization loss. Convergence to the exact minima over training loss doesn't necessarily guarantee the best generalization loss. I mention this because it contributes to the general aloofness towards optimization convergence rates in ML.

> I believe many have investigated quasi-Newton methods based on estimate gradients but I haven’t investigated that thoroughly.

Until semi-recently, quasi-newton was not explored in the stochastic setting because of the question of how to extend the Wolfe conditions to this arena. There's been a bit of work on this [1], but I don't think it's caught on outside of the optimization community (not that it necessarily should considering the points above).

[1]: https://arxiv.org/abs/1401.7020

link

JunaidB 2401 days ago

Your point about convergence to the exact minima over training loss not guaranteeing the best generalization loss reminds me of the point made in this lecture here https://www.youtube.com/watch?v=k3AiUhwHQ28.

You also made an interesting comment about work not catching on outside of the optimization community - can you recommend some resources or websites to follow in order to see what the optimization community is working on? I've developed an interest in the area but don't really know where to go for "up to date" information.

link

smallcharleston 2401 days ago

I’m not that other guy and I also haven’t read this paper but it seems quite thorough

https://arxiv.org/abs/1606.04838

link

JunaidB 2401 days ago

This seems like an excellent review. I'll check it out. Thanks very much!

link