Hacker News new | ask | show | jobs
by thorel 1440 days ago
Regarding stochastic gradient descent, I think there has been an increased understanding in recent years, that the randomness introduced by the random sampling/batching is not only helpful in reducing the computational cost (compared to computing the full gradient) but also in adding noise to escape local minima. Some variants of stochastic gradient descent in fact add some additional random noise to amplify this latter effect and some theoretical guarantees have started to emerge.