| HN Mirror

Interestingly (another empirical result that's poorly understood), with stochastic gradient descent, convergence often only requires one pass through the data, if not it might take a small number.

And yes this field only exists because we are presuming a really large amount of data, which often can't even fit on the same hard drive. And a really complex model.

Older kernel methods basically do what you want for tractable datasets. They can do very high-order polynomials, and also add the ability to regularize the solution various ways. Though again, I would be interested in seeing those methods compared to a simple least-squares fit as you propose, which people often didn't do even back when kernel methods were all the rage.