| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by chestervonwinch 3379 days ago

drewm1980 is not totally incorrect, though. I don't believe that people began using batch gradient descent because they knew a priori that it had decent properties. I would venture to guess that, initially, people were constrained my memory and computational tractability and only observed after-the-fact that SGD actually worked and in fact, often worked nicer than non-stochastic descent.

I'm not sure really at what point SGD was linked to stochastic approximation, and theoretical explanation of convergence behavior was really pinned down. It feels recent (see work by Leon Bottou, in particular), but I'm certainly no expert in this area.

--- edits: grammar

1 comments

jg8610 3378 days ago

So interestingly, SGD has a nice intuitive explanation for why it is better than GD.

If you compute the gradient step for all data, you're expending computational power on redundant data. You're going to get to the minimum with fewer data if you make steps as you get useful information.

link