| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by d3m0t3p 577 days ago
	Well, it's just like stochastic gradient descent, if you think about it. The normal gradient descent is computed using the whole training set. The stochastic gradient is trained on a batch (a subset of the training set), and in the distributed case, we compute two batches at once by doing the gradient on each in parallel. The intuition works IMO, but indeed, having the first batch update and then the second, is not equal to having the mean update. This is indeed super cool !

1 comments

eru 577 days ago

Does anyone actually use the 'normal gradient descent' with the whole training set? I only ever see it as a sort of straw man to make explanation easier.

link

jey 577 days ago

Generally yes, vanilla gradient descent gets plenty of use. But for LLMs: no, it’s not really used, and stochastic gradient descent provides a form of regularization, so it probably works better in addition to being more practical.

link

bravura 577 days ago

Full batch with L-BFGS, when possible, is wildly underappreciated.

link