Hacker News new | ask | show | jobs
by jg8610 3374 days ago
So interestingly, SGD has a nice intuitive explanation for why it is better than GD.

If you compute the gradient step for all data, you're expending computational power on redundant data. You're going to get to the minimum with fewer data if you make steps as you get useful information.