|
|
|
|
|
by pjreddie
3374 days ago
|
|
Mini-batches are not an ugly hack, batch gradient descent is too slow since you have to go through the whole data and stochastic gradient descent is too high variance (plus you can't do cool things like batch-norm). Mini-batches give you stability and speed, the best of both worlds. |
|
I'm not sure really at what point SGD was linked to stochastic approximation, and theoretical explanation of convergence behavior was really pinned down. It feels recent (see work by Leon Bottou, in particular), but I'm certainly no expert in this area.
--- edits: grammar