Hacker News new | ask | show | jobs
by atmosfir 2117 days ago
Isn't this just optimization of control parameters?

What is the comparison with existing control engineering methods like PID tuning, MPC and optimal control?

2 comments

I think you are right, in the case of the trebuchet, it just computes a black box approximation of the inverse of the system. The difference being that an analytic inversion would solve the problem for all wind and target conditions, while the NN solution will only work for for value ranges used to obtain the data that trained the NN.

In the case of the inverted pendulum, again the disadvantage of using NN is the black-bock nature of the control algorithm. As a control engineer this gives me the chills, because using black-box algorithms tell you nothing about the robustness of the closed loop system. With model-based control, at least we have strong mathematical tools to guarantee that the closed loop will be robust enough to handle variations outside the data that we used for training (our model). With black-box algorithms like NN you have no guarantees. I would not get into a plane controlled by a NN for sure, look what happened with the 737MAX when software engineers thought they could solve dynamical system problems.

To respond to both the parent question, and this comment: indeed, this is black-box optimal control in essence.

However, this method is just one small aspect of the SciML [0] ecosystem now. The article is a little outdated in that sense.

Once obtaining your NN control parameter, it's now possible to use Sparse Identification of Nonlinear Dynamics (SINDy) on that parameter to recover equations of motion governing it [1].

The real promise of these methods is to use the universal approximator power of NNs to get around the 'curse of dimensionality' & uncover presently unknown representations of motion within any system. Take a look at [2] for a more detailed description.

[0]: https://sciml.ai/ [1]: https://datadriven.sciml.ai/dev/sparse_identification/sindy/ [2]: https://arxiv.org/abs/2001.04385

"The real promise of these methods is to use the universal approximator power of NNs...", still if one is to use a grey-box non-linear model dx/dt = F(x, u, t), why use NNs to characterize F? I would be more comfortable using a polynomial to characterize non-linearity than a "deep" black-box.

Polynomials are much easier to "train" because it is just one linear regression with no iteration. It has also been hinted that NN are in essence polynomial regressions [0]. Furthermore, most activation functions are base on e^x where the actual implementation of e^x in a computer is again a polynomial!

[0] https://arxiv.org/abs/1806.06850

Gradient descent is already about as easy a training method as can be. Just a little freshman calculus and programmers can do the "state of the art" optimization of modern times. It's also scalable. If your polynomial regression gets too large because of the model complexity (for comparison, typical deep networks can have millions of parameters) you can't invert your matrix and probably end up using a similar method anyway.

I would have thought a computer uses tables to compute e^x. There's also piecewise linear activation functions that are trivially easy to compute gradients of.

The whole "universal approximation" perspective is pretty vague to begin with. I'd say generally people don't understand why NN's work as well as they do. Previously theorists expected they would need a lot more training data to work, given their complexity. So it's driven to a large degree by empirical success. I am certainly really interested to see people accomplishing the same things with less sophisticated methods, since there is no doubt it has been overused/hyped in some areas just to make the papers and proposals sexier.

> The whole "universal approximation" perspective is pretty vague to begin with

Multiple times this. This claim gets trotted around frequently to showcase superiority of NNs.

At best this is a red herring at worst it is dishonest. The problem is they aren't the only universal approximators. There is a whole slew of them, nearest neighbor approximators, polynomials, rational splines, kernel methods … Furthermore the universal approximation property holds under conditions.

Finally, the ability to represent a function arbitrarily well (approximation property) does not mean that one will be able to find the representation from data easily (learning property). Empirical evidence suggests that among the class of universal approximators we know, NNs seems easy to train effectively. Why this is so s not quite well understood.

wavelets, sum of exponentials, fourier, ... I just mentioned polynomials because they are easiest. But people just jump into the NN bandwagon to get attention. Truth is that is just another tool, and a good engineer has to choose the best tool form the toolbox and not just pick the hammer everytime.
For polynomial regression of the type y = p0 + p1x + p2x^2 + ... + pnx^n, the "training" algorithm is linear least squares (no need of gradient descent). Assuming you have data (y, x), the explicit least squares solution is P = pinv(X) Y, see:

y = [1, x, x^2, ... x^n][p0, p1, p2, ..., pn]^T = XP

XP = y

(X^T X)P = (X^T) y

P = (X^T * X)^-1 * (X^T) * y

(X^T * X)^-1 * (X^T) is called the pseudo-inverse of X (which contains all your data). No need of iterations. A similar solution is found for multi-variable polynomials i.e. y = f(x1, x2, ..., xm) where f is a polynomial containing combinations of the independent variables xm and their powers.

> ... (X^T * X)^-1 ...

This is the matrix inversion I was referring to. It's size (at best) depends on the smaller of the number of parameters and the amount of training samples. Both get very big in machine learning. When this happens you need to use some kind of low-memory iterative method like Greville's algorithm or even gradient descent itself. So you're ultimately not any better off.

Polynomials (or rather multinomials) suffer from the curse of dimensionality badly when needing more terms (look how taylor series terms explode). Neural networks do better. The fact that a neural network is computed using polynomials is irrelevant since the way the NN is parametrized is different from a sum of a basis of polynomials. You can inspect the vector field to proof certain properties of the neural network. SINDy is already mentioned in another reply.
Yeah, this is the crux. Here's a comment from one of the devs when I asked about the polynomial vs NN basis:

The answer is quite simple really. Classical basis functions suffer from the curse of dimensionality because if you tensor product polynomial basis functions or things like Fourier basis, with N basis functions in each direction, then you have N^d parameters that are required in order to handle every combination `sin(x) + sin(2x) + ... + sin(y) + sin(2y) + ... + sin(x)sin(y) + sin(2x)sin(y) + ....`

Neural networks only grow polynomially with dimensional, so at around 8 dimensional objects it becomes more efficient. In fact, this is why we have https://diffeqflux.sciml.ai/dev/layers/BasisLayers/

Polynomials are just an example, the easiest one. The point is that there are many more universal approximators (as some other user commented here), many of them much more suitable for control applications than NNs.
I'm not entirely confident in answering that directly, so perhaps you can check my reasoning here.

If F is completely unknown, perhaps you start training with a 10 dimensional polynomial basis. What is the (computational) cost of obtaining your solution? Once you have it, will this polynomial accurately represent your system in any real world manner? Perhaps higher order parameters are needed to approximate trigonometric functions - are you able to easily add such functions to your training basis? If not - then your basis could be too restrictive to provide you with a minimal implementation of your control variable.

It looks like you work with this stuff far more than I have, so perhaps that's not an adequate answer.

Another way to look at this though: If you only wanted to characterise your system with polynomials, UODEs + SINDy can do this for you - the NN is simply the optimisation method that's in place of any other optimisation algorithm.

The computational cost of "training" a polynomial would be the same as just one iteration of the training algorithm used by typical NNs. When it comes to trig functions, the story is the same as with the exp function e(). When you call the sin() or the cos() functions in your favorite language, in the end it uses taylor series (polynomials) to compute it (plus some hacks to add precision on certain ranges of the function and to overcome some floating point precision limitations).

The degree at which a polynomial model would fit the real world system has to be validated against data, just the same as with NNs. What does one do when an NN fit is not good enough or too good (overfitting)? One adds or removes layers. Same with polynomials, one increases or decreases degrees.

Sorry for the rant, I am not saying NNs are useless, because I do believe they are super useful for certain problems, specially for categorization. But it seems to me that now a days there is this trend of using NNs as a hammer, and not all problems are nails. Specially when it comes to control, and lives or big economic losses are at stake, it is the responsibility of the engineer to resist the fuzz and craze and use the right tool for the problem.

There is a saying in mathematics that the fastest way to a solution is through the complex plane. This was discovered because a lot of proofs are nicer by doing analytical continuation and analyzing the properties of the continued function. Complex-step differentiation is another example of this.

In some sense, something similar applies to neural networks in this context. Have you done a lot of fitting of classical basis methods inside of differential equations? They are very prone to local minima, so direct training of polynomials inside of a differential equation is rather hard. But through neural network magic, somehow related to [1], which essentially state that local minima are the global minima on large enough neural networks. So this lets you get pretty lazy and just do local optimization to find missing functions, and then sparsify to polynomials later, in a way where the optimization is better behaved than going directly to polynomials. The DiffEqFlux library has both approaches available, so you can try both side by side and see the difference. From years of experience doing the former, the latter is quite a breath of fresh air.

   [1] https://arxiv.org/abs/1412.0233
> The computational cost of "training" a polynomial would be the same as just one iteration of the training algorithm used by typical NNs.

That statement depends heavily on the dimensionality of the problem. Polynomials also have huge problems with discontinuities (even in some higher order derivative) sometimes would require an infinite number of polynomials to smooth out the errors around the discontinuities. (try to fit the Integral of |x| with polynomials)

Fear of NN in control is justified if the networks are poorly understood.

I certainly agree with the NNs are used as hammers point. Until coming across the UODE concept I was of the opinion they were more parlour trick than anything useful. Here though, I could see some validity.

These comments are appreciated - I think a discussion like this is lacking in the SciML docs (or at least not visible enough). Will have a chat with some of the devs and see if there's something we can add.

I assume you are talking about the pendulum. PID tuning is fundamentally restricted to holding set points and can't deal with non-linearites to reach those set points.

MPC if solved over a to small time horizon might fail to find a windup trajectory. But that isn't a restriction. However i would consider MPC overblown for this, as MPC is usually a series of optimal control problems.

By optimal control i assume you mean things approaches where you optimize a u() so it minimizes some integral. This approach does just that, only that u is not parametrized by polynomials/splines but by a neural network, taking state, time and other things into account.