Hacker News new | ask | show | jobs
by muzakthings 3356 days ago
This is literally non-stochastic gradient descent where the batch update simply comes from a single node and a correlated set of examples. Nothing mind-blowing about it.
1 comments

For the bits described in https://arxiv.org/abs/1610.02527 you're essentially correct. Though it's still stochastic, and you can have mini-batching on each node.

The interesting technical bits are in https://arxiv.org/abs/1610.05492

To save on update bandwidth, they either restrict the gradient to a lower dimensional space, or compress by quantizing the full gradient (which should effectively add zero-mean noise) before sending it back. (In theory they could do both of these, but they didn't actually test that.)

Just because you shuffle the examples on a single phone/user doesn't make it stochastic.

The entire point of using stochasticity (ie: random shuffling) is to avoid similar and/or a same-ordered run of examples from redirecting the hill climbing in a globally non-optimal direction all at once.

A single user's examples will be very similar, so you can shuffle all the examples from one user you want - that doesn't make it truly stochastic in the context of gradient descent optimization.

The quantization / compression part is pretty cool though. I suppose that could obfuscate slightly what the original example was for privacy purposes? Seems like you'd lose on accuracy though.