|
|
|
|
|
by svalorzen
2674 days ago
|
|
RL is actually quite an umbrella term for a lot of things. There's policy gradient methods, which improve directly on the policy to select better actions, there's value based methods which try to approximate the value function of the problem, and get a policy from that, and there's model based methods which try to learn a model and do some sort of planning/processing in order to get the policy. Using model based methods can allow you to do some pretty fancy stuff while massively reducing the number of data samples you need, but on the other side there's a trade off. Using the model usually tends to require lots of not-very-parallelizable computations, and can be more costly computationally. Very large problems can get out of hand pretty quickly, and there's still a lot of work to do before there is something which can be applied in general quickly and efficiently. |
|
I would say though that from my experience, computational cost is rarely the issue with model-based control, because there are various attacks ranging from model simplification (surrogate models, piecewise-affine multi-models i.e. switching between many simpler local models, etc) to precomputing the optimal control law [1] to embedding the model in silicon. Also, some optimal models/control laws can actually parallelized fairly easily (MLD models are expressed mixed-integer programs which can be solved in performant ways using parallel algorithms, with some provisos). This is a well-trodden space with a tremendous amount of industry-driven research behind it.
Most of these methods come under the Model Predictive Control (MPC) umbrella which has been studied extensively over 3 decades [2]. The paradigm is extremely simple: (1) given a model of how output y responds to input u, predict over the next n time periods the values of u's needed to optimize an objective function. (2) Implement ONLY the first u. (3) Read the sensor value for y (actual y in real world). (4) Update your model with the difference between actual y and predicted y, move the prediction window forward, and repeat (feedback). When this is applied recursively, you obtain approximately optimal control on real-life systems even in the presence of model-reality mismatch, noise and bounded uncertainty.
If you think about it, this is the paradigm behind many planning strategies -- forecast, take a small action, get feedback, try again. The difference though is that MPC is a strategy with a substantial amount of mathematical theory (including stability analysis, reachability, controllability, etc.), software, and industrial practice behind it.
[1] Explicit MPC http://divf.eng.cam.ac.uk/cfes/pub/Main/Presentations/Morari...
[2] https://en.wikipedia.org/wiki/Model_predictive_control