|
|
|
|
|
by wnoise
3356 days ago
|
|
For the bits described in https://arxiv.org/abs/1610.02527 you're essentially correct. Though it's still stochastic, and you can have mini-batching on each node. The interesting technical bits are in https://arxiv.org/abs/1610.05492 To save on update bandwidth, they either restrict the gradient to a lower dimensional space, or compress by quantizing the full gradient (which should effectively add zero-mean noise) before sending it back. (In theory they could do both of these, but they didn't actually test that.) |
|
The entire point of using stochasticity (ie: random shuffling) is to avoid similar and/or a same-ordered run of examples from redirecting the hill climbing in a globally non-optimal direction all at once.
A single user's examples will be very similar, so you can shuffle all the examples from one user you want - that doesn't make it truly stochastic in the context of gradient descent optimization.
The quantization / compression part is pretty cool though. I suppose that could obfuscate slightly what the original example was for privacy purposes? Seems like you'd lose on accuracy though.