Hacker News new | ask | show | jobs
by d3m0t3p 530 days ago
Well, it's just like stochastic gradient descent, if you think about it. The normal gradient descent is computed using the whole training set. The stochastic gradient is trained on a batch (a subset of the training set), and in the distributed case, we compute two batches at once by doing the gradient on each in parallel. The intuition works IMO, but indeed, having the first batch update and then the second, is not equal to having the mean update.

This is indeed super cool !

1 comments

Does anyone actually use the 'normal gradient descent' with the whole training set? I only ever see it as a sort of straw man to make explanation easier.
Generally yes, vanilla gradient descent gets plenty of use. But for LLMs: no, it’s not really used, and stochastic gradient descent provides a form of regularization, so it probably works better in addition to being more practical.
Full batch with L-BFGS, when possible, is wildly underappreciated.