| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by pjreddie 3374 days ago
	Mini-batches are not an ugly hack, batch gradient descent is too slow since you have to go through the whole data and stochastic gradient descent is too high variance (plus you can't do cool things like batch-norm). Mini-batches give you stability and speed, the best of both worlds.

2 comments

chestervonwinch 3374 days ago

drewm1980 is not totally incorrect, though. I don't believe that people began using batch gradient descent because they knew a priori that it had decent properties. I would venture to guess that, initially, people were constrained my memory and computational tractability and only observed after-the-fact that SGD actually worked and in fact, often worked nicer than non-stochastic descent.

I'm not sure really at what point SGD was linked to stochastic approximation, and theoretical explanation of convergence behavior was really pinned down. It feels recent (see work by Leon Bottou, in particular), but I'm certainly no expert in this area.

--- edits: grammar

link

jg8610 3374 days ago

So interestingly, SGD has a nice intuitive explanation for why it is better than GD.

If you compute the gradient step for all data, you're expending computational power on redundant data. You're going to get to the minimum with fewer data if you make steps as you get useful information.

link

p1esk 3374 days ago

A brain learns one example at a time, which probably means a gradient descent is not a very good learning mechanism.

link

jacquesm 3374 days ago

These are not brains, they are not even models of brains even if they use the word 'neuron'.

link

p1esk 3374 days ago

That was kind of my point. Brain learns well using one example at a time. ANNs don't. Hence my conclusion.

link

akyu 3374 days ago

You can use gradient descent one example at a time. It still works just fine. The gradients are more unstable, but you will still converge eventually.

link

p1esk 3374 days ago

"works just fine" is a relative term. I'm pretty sure when I'm learning a (new) alphabet, I don't need to see 1000 examples of each letter 100 times.

link