| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by pidtuner 2123 days ago

For polynomial regression of the type y = p0 + p1x + p2x^2 + ... + pnx^n, the "training" algorithm is linear least squares (no need of gradient descent). Assuming you have data (y, x), the explicit least squares solution is P = pinv(X) Y, see:

y = [1, x, x^2, ... x^n][p0, p1, p2, ..., pn]^T = XP

XP = y

(X^T X)P = (X^T) y

P = (X^T * X)^-1 * (X^T) * y

(X^T * X)^-1 * (X^T) is called the pseudo-inverse of X (which contains all your data). No need of iterations. A similar solution is found for multi-variable polynomials i.e. y = f(x1, x2, ..., xm) where f is a polynomial containing combinations of the independent variables xm and their powers.

1 comments

unishark 2123 days ago

> ... (X^T * X)^-1 ...

This is the matrix inversion I was referring to. It's size (at best) depends on the smaller of the number of parameters and the amount of training samples. Both get very big in machine learning. When this happens you need to use some kind of low-memory iterative method like Greville's algorithm or even gradient descent itself. So you're ultimately not any better off.

link

pidtuner 2123 days ago

In practice one computes (X^T * X)^-1 * (X^T) in one go using Singular Value Decomposition, for which very efficient algorithms exists. But if there is really a lot of data, then recursive linear least squares can be used, to partition the larger least squares into smaller pieces. But then again, you just make one pass on the data, not multiple passes, like with gradient descent.

link

unishark 2123 days ago

Interestingly (another empirical result that's poorly understood), with stochastic gradient descent, convergence often only requires one pass through the data, if not it might take a small number.

And yes this field only exists because we are presuming a really large amount of data, which often can't even fit on the same hard drive. And a really complex model.

Older kernel methods basically do what you want for tractable datasets. They can do very high-order polynomials, and also add the ability to regularize the solution various ways. Though again, I would be interested in seeing those methods compared to a simple least-squares fit as you propose, which people often didn't do even back when kernel methods were all the rage.

link