|
|
|
|
|
by chestervonwinch
3379 days ago
|
|
drewm1980 is not totally incorrect, though. I don't believe that people began using batch gradient descent because they knew a priori that it had decent properties. I would venture to guess that, initially, people were constrained my memory and computational tractability and only observed after-the-fact that SGD actually worked and in fact, often worked nicer than non-stochastic descent. I'm not sure really at what point SGD was linked to stochastic approximation, and theoretical explanation of convergence behavior was really pinned down. It feels recent (see work by Leon Bottou, in particular), but I'm certainly no expert in this area. ---
edits: grammar |
|
If you compute the gradient step for all data, you're expending computational power on redundant data. You're going to get to the minimum with fewer data if you make steps as you get useful information.