|
|
|
|
|
by d3m0t3p
530 days ago
|
|
Well, it's just like stochastic gradient descent, if you think about it. The normal gradient descent is computed using the whole training set. The stochastic gradient is trained on a batch (a subset of the training set), and in the distributed case, we compute two batches at once by doing the gradient on each in parallel.
The intuition works IMO, but indeed, having the first batch update and then the second, is not equal to having the mean update. This is indeed super cool ! |
|