| HN Mirror

I don't believe this to be entirely accurate:

1. Reverse mode in automatic differentiation is not as efficient at the function evaluation. Even discounting certain costs, and depending on how you count, the theoretic cost is 4-5 times a function evaluation. Practically speaking, operator overloading approaches run somewhere between 20-40 times function evaluation whereas source code transformation tools run at 10-20 times. This is fantastic, but the function evaluation is cheaper.

2. I also don't believe that stochastic gradient descent requires the entire function and gradient to be revaluated in the manner that you describe. One way to view stochastic gradient descent in the context of least squares fitting is through the use of Johnson-Lindenstrauss, which means that the data set can be randomly projected once per iteration. This means that the gradient and line-search parameters can be consistently evaluated at the per iterations level. Practically speaking, this means we randomly add our data together and then proceed as normal changing the randomization each iteration. As such, there should not be an increase in cost by doing a line-search over the already discounted cost.

3. As far as if the Wolfe conditions are destroyed, kind of sort of. In order to guarantee convergence, the amount of reduction that we use must also be reduced. Meaning, we can't project down the data quite as much if we really want to achieve convergence. However, practically speaking, I believe it to matter, a lot.