|
|
|
|
|
by murbard2
1577 days ago
|
|
I skimmed the paper, so I'm not sure I got this right, but my understanding is the following: 1. pick a random direction 2. compute the derivative along that direction using forward-mode differentiation 3. update the parameters along that direction based on the derivative The idea being that this gives an unbiased (albeit noisy) approximation of the actual gradient. You thus need a smaller learning rate, but you also need less memory and computation and, net net, they argue it's a win. Is this correct? |
|
If you pick 2 directions at each pass, one of them could be the direction of the last update and the other a random one, allowing for some kind of momentum.