|
|
|
|
|
by thorel
2581 days ago
|
|
There are a few inaccuracies here: using automatic differentiation basically makes computing the gradient of the objective function as efficient as computing the function itself. But the main goal of algorithms being used in machine learning nowadays (stochastic gradient descent and variants thereof) is to avoid having to compute the objective function or its gradient altogether: instead, the gradient is computed at a single data point (example) which provides an approximation of the true gradient. The important thing is that what is considered expensive is not to compute the objective function or its gradient, but to compute it over the entire dataset. Line-search would require evaluating the function (and its gradient) over the entire dataset several times, which completely defeats the purpose of stochastic gradient descent. One could imagine doing a line search using the approximate function or its gradient (coming from the evaluation at a single data point) as the basis for a line-search, but intuitively it does not make much sense to fine-tune the step size to a single example, and this would in any case destroy the guarantees provided by the Strong Wolfe conditions. Finally, there are convergence guarantees for stochastic gradient descent with fixed learning rate when the objective functions are convex. |
|
1. Reverse mode in automatic differentiation is not as efficient at the function evaluation. Even discounting certain costs, and depending on how you count, the theoretic cost is 4-5 times a function evaluation. Practically speaking, operator overloading approaches run somewhere between 20-40 times function evaluation whereas source code transformation tools run at 10-20 times. This is fantastic, but the function evaluation is cheaper.
2. I also don't believe that stochastic gradient descent requires the entire function and gradient to be revaluated in the manner that you describe. One way to view stochastic gradient descent in the context of least squares fitting is through the use of Johnson-Lindenstrauss, which means that the data set can be randomly projected once per iteration. This means that the gradient and line-search parameters can be consistently evaluated at the per iterations level. Practically speaking, this means we randomly add our data together and then proceed as normal changing the randomization each iteration. As such, there should not be an increase in cost by doing a line-search over the already discounted cost.
3. As far as if the Wolfe conditions are destroyed, kind of sort of. In order to guarantee convergence, the amount of reduction that we use must also be reduced. Meaning, we can't project down the data quite as much if we really want to achieve convergence. However, practically speaking, I believe it to matter, a lot.