Hacker News new | ask | show | jobs
by pjreddie 3374 days ago
Mini-batches are not an ugly hack, batch gradient descent is too slow since you have to go through the whole data and stochastic gradient descent is too high variance (plus you can't do cool things like batch-norm). Mini-batches give you stability and speed, the best of both worlds.
2 comments

drewm1980 is not totally incorrect, though. I don't believe that people began using batch gradient descent because they knew a priori that it had decent properties. I would venture to guess that, initially, people were constrained my memory and computational tractability and only observed after-the-fact that SGD actually worked and in fact, often worked nicer than non-stochastic descent.

I'm not sure really at what point SGD was linked to stochastic approximation, and theoretical explanation of convergence behavior was really pinned down. It feels recent (see work by Leon Bottou, in particular), but I'm certainly no expert in this area.

--- edits: grammar

So interestingly, SGD has a nice intuitive explanation for why it is better than GD.

If you compute the gradient step for all data, you're expending computational power on redundant data. You're going to get to the minimum with fewer data if you make steps as you get useful information.

A brain learns one example at a time, which probably means a gradient descent is not a very good learning mechanism.
These are not brains, they are not even models of brains even if they use the word 'neuron'.
That was kind of my point. Brain learns well using one example at a time. ANNs don't. Hence my conclusion.
You can use gradient descent one example at a time. It still works just fine. The gradients are more unstable, but you will still converge eventually.
"works just fine" is a relative term. I'm pretty sure when I'm learning a (new) alphabet, I don't need to see 1000 examples of each letter 100 times.