| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by pidtuner 2125 days ago

"The real promise of these methods is to use the universal approximator power of NNs...", still if one is to use a grey-box non-linear model dx/dt = F(x, u, t), why use NNs to characterize F? I would be more comfortable using a polynomial to characterize non-linearity than a "deep" black-box.

Polynomials are much easier to "train" because it is just one linear regression with no iteration. It has also been hinted that NN are in essence polynomial regressions [0]. Furthermore, most activation functions are base on e^x where the actual implementation of e^x in a computer is again a polynomial!

[0] https://arxiv.org/abs/1806.06850

3 comments

unishark 2125 days ago

Gradient descent is already about as easy a training method as can be. Just a little freshman calculus and programmers can do the "state of the art" optimization of modern times. It's also scalable. If your polynomial regression gets too large because of the model complexity (for comparison, typical deep networks can have millions of parameters) you can't invert your matrix and probably end up using a similar method anyway.

I would have thought a computer uses tables to compute e^x. There's also piecewise linear activation functions that are trivially easy to compute gradients of.

The whole "universal approximation" perspective is pretty vague to begin with. I'd say generally people don't understand why NN's work as well as they do. Previously theorists expected they would need a lot more training data to work, given their complexity. So it's driven to a large degree by empirical success. I am certainly really interested to see people accomplishing the same things with less sophisticated methods, since there is no doubt it has been overused/hyped in some areas just to make the papers and proposals sexier.

srean 2125 days ago

> The whole "universal approximation" perspective is pretty vague to begin with

Multiple times this. This claim gets trotted around frequently to showcase superiority of NNs.

At best this is a red herring at worst it is dishonest. The problem is they aren't the only universal approximators. There is a whole slew of them, nearest neighbor approximators, polynomials, rational splines, kernel methods … Furthermore the universal approximation property holds under conditions.

Finally, the ability to represent a function arbitrarily well (approximation property) does not mean that one will be able to find the representation from data easily (learning property). Empirical evidence suggests that among the class of universal approximators we know, NNs seems easy to train effectively. Why this is so s not quite well understood.

pidtuner 2125 days ago

wavelets, sum of exponentials, fourier, ... I just mentioned polynomials because they are easiest. But people just jump into the NN bandwagon to get attention. Truth is that is just another tool, and a good engineer has to choose the best tool form the toolbox and not just pick the hammer everytime.

ChrisRackauckas 2125 days ago

For reference, the DiffEqFlux library has a bunch of classical basis layers [1] and ways to tensor product them [2] for this reason. The real answer as to when to use a neural network is quite complicated [3], but in summary the results all point to the fact that for approximating an R^n -> R^m function, one only needs polynomially many parameters in order to do it well (as proven in a few cases like in that linked paper for "any case where Monte Carlo algorithms are not exponential in dimension"). Tensor products of classical basis functions have to cover every combination of terms (sin(ix)*sin(jy)) so they naturally grow like p^n if you have p parameters in each dimension, so this exponential parameter growth is the curse of dimensionality and this polynomial growth is the formal way of describing how neural networks overcome the curse of dimensionality. So what is useful can depend on a number of factors (another property is the isotropy of the function you're trying to approximate), but this asymptotic property is what makes neural networks a good tool in the high dimensional world where they are commonly used. That makes them quite good as well for things like feedback controllers of larger ODE systems. But yes, in smaller dimensional cases Fourier basis and such are good choices.

    [1] https://diffeqflux.sciml.ai/dev/layers/BasisLayers/
    [2] https://diffeqflux.sciml.ai/dev/layers/TensorLayer/
    [3] https://arxiv.org/abs/1908.10828

pidtuner 2121 days ago

Fitting to sum of N exponentials is also a linear problem with no iterations https://math.stackexchange.com/questions/1428566/fit-sum-of-...

jessaustin 2125 days ago

[1] https://diffeqflux.sciml.ai/dev/layers/BasisLayers/

[2] https://diffeqflux.sciml.ai/dev/layers/TensorLayer/

[3] https://arxiv.org/abs/1908.10828

pidtuner 2125 days ago

Yes! Thank you Sir! I can see you know what you are talking about. This is my point, NNs are very useful for some problems, for others they are not worth the complexity and black-box nature.

pidtuner 2125 days ago

For polynomial regression of the type y = p0 + p1x + p2x^2 + ... + pnx^n, the "training" algorithm is linear least squares (no need of gradient descent). Assuming you have data (y, x), the explicit least squares solution is P = pinv(X) Y, see:

y = [1, x, x^2, ... x^n][p0, p1, p2, ..., pn]^T = XP

XP = y

(X^T X)P = (X^T) y

P = (X^T * X)^-1 * (X^T) * y

(X^T * X)^-1 * (X^T) is called the pseudo-inverse of X (which contains all your data). No need of iterations. A similar solution is found for multi-variable polynomials i.e. y = f(x1, x2, ..., xm) where f is a polynomial containing combinations of the independent variables xm and their powers.

unishark 2125 days ago

> ... (X^T * X)^-1 ...

This is the matrix inversion I was referring to. It's size (at best) depends on the smaller of the number of parameters and the amount of training samples. Both get very big in machine learning. When this happens you need to use some kind of low-memory iterative method like Greville's algorithm or even gradient descent itself. So you're ultimately not any better off.

pidtuner 2125 days ago

In practice one computes (X^T * X)^-1 * (X^T) in one go using Singular Value Decomposition, for which very efficient algorithms exists. But if there is really a lot of data, then recursive linear least squares can be used, to partition the larger least squares into smaller pieces. But then again, you just make one pass on the data, not multiple passes, like with gradient descent.

unishark 2125 days ago

Interestingly (another empirical result that's poorly understood), with stochastic gradient descent, convergence often only requires one pass through the data, if not it might take a small number.

And yes this field only exists because we are presuming a really large amount of data, which often can't even fit on the same hard drive. And a really complex model.

Older kernel methods basically do what you want for tractable datasets. They can do very high-order polynomials, and also add the ability to regularize the solution various ways. Though again, I would be interested in seeing those methods compared to a simple least-squares fit as you propose, which people often didn't do even back when kernel methods were all the rage.

freemint 2125 days ago

Polynomials (or rather multinomials) suffer from the curse of dimensionality badly when needing more terms (look how taylor series terms explode). Neural networks do better. The fact that a neural network is computed using polynomials is irrelevant since the way the NN is parametrized is different from a sum of a basis of polynomials. You can inspect the vector field to proof certain properties of the neural network. SINDy is already mentioned in another reply.

Libbum 2125 days ago

Yeah, this is the crux. Here's a comment from one of the devs when I asked about the polynomial vs NN basis:

The answer is quite simple really. Classical basis functions suffer from the curse of dimensionality because if you tensor product polynomial basis functions or things like Fourier basis, with N basis functions in each direction, then you have N^d parameters that are required in order to handle every combination `sin(x) + sin(2x) + ... + sin(y) + sin(2y) + ... + sin(x)sin(y) + sin(2x)sin(y) + ....`

Neural networks only grow polynomially with dimensional, so at around 8 dimensional objects it becomes more efficient. In fact, this is why we have https://diffeqflux.sciml.ai/dev/layers/BasisLayers/

pidtuner 2125 days ago

Polynomials are just an example, the easiest one. The point is that there are many more universal approximators (as some other user commented here), many of them much more suitable for control applications than NNs.

Libbum 2125 days ago

I'm not entirely confident in answering that directly, so perhaps you can check my reasoning here.

If F is completely unknown, perhaps you start training with a 10 dimensional polynomial basis. What is the (computational) cost of obtaining your solution? Once you have it, will this polynomial accurately represent your system in any real world manner? Perhaps higher order parameters are needed to approximate trigonometric functions - are you able to easily add such functions to your training basis? If not - then your basis could be too restrictive to provide you with a minimal implementation of your control variable.

It looks like you work with this stuff far more than I have, so perhaps that's not an adequate answer.

Another way to look at this though: If you only wanted to characterise your system with polynomials, UODEs + SINDy can do this for you - the NN is simply the optimisation method that's in place of any other optimisation algorithm.

pidtuner 2125 days ago

The computational cost of "training" a polynomial would be the same as just one iteration of the training algorithm used by typical NNs. When it comes to trig functions, the story is the same as with the exp function e(). When you call the sin() or the cos() functions in your favorite language, in the end it uses taylor series (polynomials) to compute it (plus some hacks to add precision on certain ranges of the function and to overcome some floating point precision limitations).

The degree at which a polynomial model would fit the real world system has to be validated against data, just the same as with NNs. What does one do when an NN fit is not good enough or too good (overfitting)? One adds or removes layers. Same with polynomials, one increases or decreases degrees.

Sorry for the rant, I am not saying NNs are useless, because I do believe they are super useful for certain problems, specially for categorization. But it seems to me that now a days there is this trend of using NNs as a hammer, and not all problems are nails. Specially when it comes to control, and lives or big economic losses are at stake, it is the responsibility of the engineer to resist the fuzz and craze and use the right tool for the problem.

ChrisRackauckas 2125 days ago

There is a saying in mathematics that the fastest way to a solution is through the complex plane. This was discovered because a lot of proofs are nicer by doing analytical continuation and analyzing the properties of the continued function. Complex-step differentiation is another example of this.

In some sense, something similar applies to neural networks in this context. Have you done a lot of fitting of classical basis methods inside of differential equations? They are very prone to local minima, so direct training of polynomials inside of a differential equation is rather hard. But through neural network magic, somehow related to [1], which essentially state that local minima are the global minima on large enough neural networks. So this lets you get pretty lazy and just do local optimization to find missing functions, and then sparsify to polynomials later, in a way where the optimization is better behaved than going directly to polynomials. The DiffEqFlux library has both approaches available, so you can try both side by side and see the difference. From years of experience doing the former, the latter is quite a breath of fresh air.

   [1] https://arxiv.org/abs/1412.0233

freemint 2125 days ago

> The computational cost of "training" a polynomial would be the same as just one iteration of the training algorithm used by typical NNs.

That statement depends heavily on the dimensionality of the problem. Polynomials also have huge problems with discontinuities (even in some higher order derivative) sometimes would require an infinite number of polynomials to smooth out the errors around the discontinuities. (try to fit the Integral of |x| with polynomials)

Fear of NN in control is justified if the networks are poorly understood.

srean 2125 days ago

Not just that, they tend to blow up when one extrapolates 'too' far from data. This can be controlled for using other basis functions, for example functions in a reproducing kernel Hilbert space, radial basis functions. It is best to choose the basis based upon the data (as RBFs and RKHS bases do) and not chose a basis independent of the data. This applies for polynomials too, choosing a polynomial basis that's orthogonal with respect to the data distribution makes computations much better behaved -- otherwise its common to run into ill conditioned problems that are very sensitive to noise in the data.

Libbum 2125 days ago

I certainly agree with the NNs are used as hammers point. Until coming across the UODE concept I was of the opinion they were more parlour trick than anything useful. Here though, I could see some validity.

These comments are appreciated - I think a discussion like this is lacking in the SciML docs (or at least not visible enough). Will have a chat with some of the devs and see if there's something we can add.