Hacker News new | ask | show | jobs
by murbard2 1573 days ago
I wonder if there's any value in picking not one but a couple random directions, and then finding the update in their span. Increases memory consumption but reduces the noise. It could be that the trade-off is best when picking a single direction, but who knows?

If you pick 2 directions at each pass, one of them could be the direction of the last update and the other a random one, allowing for some kind of momentum.

2 comments

Yes there is value, but there is also some cost, and this value can be grabbed naturally by using an optimizer that can make use of multiple direction.

Optimizer like O-LBFGS maintain internally a low-dimension quadratic approximation of the cost function, that is updated with each noisy gradient evaluation, and then the optimizer consist of taking a step in the direction that would minimize that approximation.

In the same line of thought, one can ask whether a line-search like in usual optimization algorithms is worth it or not.

The memory cost is usually storing the parameters multiple times and an increase computing cost per iteration. Whether the tradeoff is worth it or not is usually problem dependent. Using a quadratic approximation is more costly than using a linear interpolation but it often pays off when encountering saddle points.

There are also technique like Polyak averaging (parameter tracking, Teacher-Student...) that have some success in Reinforcement Learning, where they help to regularize the energy landscape.

OpenAI "Evolution Strategy" used a low number of random directions, and they found that they needed roughly ~100 (= implicit dimension of the problem) directions to have equivalent performance than backprop, but they were computing the derivative with finite difference instead of forward differentiation.

Choosing the direction of travel to measure he gradient in or sampling around it makes sense to me too.