Hacker News new | ask | show | jobs
by unishark 2123 days ago
Interestingly (another empirical result that's poorly understood), with stochastic gradient descent, convergence often only requires one pass through the data, if not it might take a small number.

And yes this field only exists because we are presuming a really large amount of data, which often can't even fit on the same hard drive. And a really complex model.

Older kernel methods basically do what you want for tractable datasets. They can do very high-order polynomials, and also add the ability to regularize the solution various ways. Though again, I would be interested in seeing those methods compared to a simple least-squares fit as you propose, which people often didn't do even back when kernel methods were all the rage.