Hacker News new | ask | show | jobs
by unishark 2123 days ago
Gradient descent is already about as easy a training method as can be. Just a little freshman calculus and programmers can do the "state of the art" optimization of modern times. It's also scalable. If your polynomial regression gets too large because of the model complexity (for comparison, typical deep networks can have millions of parameters) you can't invert your matrix and probably end up using a similar method anyway.

I would have thought a computer uses tables to compute e^x. There's also piecewise linear activation functions that are trivially easy to compute gradients of.

The whole "universal approximation" perspective is pretty vague to begin with. I'd say generally people don't understand why NN's work as well as they do. Previously theorists expected they would need a lot more training data to work, given their complexity. So it's driven to a large degree by empirical success. I am certainly really interested to see people accomplishing the same things with less sophisticated methods, since there is no doubt it has been overused/hyped in some areas just to make the papers and proposals sexier.

2 comments

> The whole "universal approximation" perspective is pretty vague to begin with

Multiple times this. This claim gets trotted around frequently to showcase superiority of NNs.

At best this is a red herring at worst it is dishonest. The problem is they aren't the only universal approximators. There is a whole slew of them, nearest neighbor approximators, polynomials, rational splines, kernel methods … Furthermore the universal approximation property holds under conditions.

Finally, the ability to represent a function arbitrarily well (approximation property) does not mean that one will be able to find the representation from data easily (learning property). Empirical evidence suggests that among the class of universal approximators we know, NNs seems easy to train effectively. Why this is so s not quite well understood.

wavelets, sum of exponentials, fourier, ... I just mentioned polynomials because they are easiest. But people just jump into the NN bandwagon to get attention. Truth is that is just another tool, and a good engineer has to choose the best tool form the toolbox and not just pick the hammer everytime.
For reference, the DiffEqFlux library has a bunch of classical basis layers [1] and ways to tensor product them [2] for this reason. The real answer as to when to use a neural network is quite complicated [3], but in summary the results all point to the fact that for approximating an R^n -> R^m function, one only needs polynomially many parameters in order to do it well (as proven in a few cases like in that linked paper for "any case where Monte Carlo algorithms are not exponential in dimension"). Tensor products of classical basis functions have to cover every combination of terms (sin(ix)*sin(jy)) so they naturally grow like p^n if you have p parameters in each dimension, so this exponential parameter growth is the curse of dimensionality and this polynomial growth is the formal way of describing how neural networks overcome the curse of dimensionality. So what is useful can depend on a number of factors (another property is the isotropy of the function you're trying to approximate), but this asymptotic property is what makes neural networks a good tool in the high dimensional world where they are commonly used. That makes them quite good as well for things like feedback controllers of larger ODE systems. But yes, in smaller dimensional cases Fourier basis and such are good choices.

    [1] https://diffeqflux.sciml.ai/dev/layers/BasisLayers/
    [2] https://diffeqflux.sciml.ai/dev/layers/TensorLayer/
    [3] https://arxiv.org/abs/1908.10828
Fitting to sum of N exponentials is also a linear problem with no iterations https://math.stackexchange.com/questions/1428566/fit-sum-of-...
Yes! Thank you Sir! I can see you know what you are talking about. This is my point, NNs are very useful for some problems, for others they are not worth the complexity and black-box nature.
For polynomial regression of the type y = p0 + p1x + p2x^2 + ... + pnx^n, the "training" algorithm is linear least squares (no need of gradient descent). Assuming you have data (y, x), the explicit least squares solution is P = pinv(X) Y, see:

y = [1, x, x^2, ... x^n][p0, p1, p2, ..., pn]^T = XP

XP = y

(X^T X)P = (X^T) y

P = (X^T * X)^-1 * (X^T) * y

(X^T * X)^-1 * (X^T) is called the pseudo-inverse of X (which contains all your data). No need of iterations. A similar solution is found for multi-variable polynomials i.e. y = f(x1, x2, ..., xm) where f is a polynomial containing combinations of the independent variables xm and their powers.

> ... (X^T * X)^-1 ...

This is the matrix inversion I was referring to. It's size (at best) depends on the smaller of the number of parameters and the amount of training samples. Both get very big in machine learning. When this happens you need to use some kind of low-memory iterative method like Greville's algorithm or even gradient descent itself. So you're ultimately not any better off.

In practice one computes (X^T * X)^-1 * (X^T) in one go using Singular Value Decomposition, for which very efficient algorithms exists. But if there is really a lot of data, then recursive linear least squares can be used, to partition the larger least squares into smaller pieces. But then again, you just make one pass on the data, not multiple passes, like with gradient descent.
Interestingly (another empirical result that's poorly understood), with stochastic gradient descent, convergence often only requires one pass through the data, if not it might take a small number.

And yes this field only exists because we are presuming a really large amount of data, which often can't even fit on the same hard drive. And a really complex model.

Older kernel methods basically do what you want for tractable datasets. They can do very high-order polynomials, and also add the ability to regularize the solution various ways. Though again, I would be interested in seeing those methods compared to a simple least-squares fit as you propose, which people often didn't do even back when kernel methods were all the rage.