| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by wenc 2680 days ago

My understanding is RL is a reasonable attack for situations where the environment is either (1) mathematically uncharacterized (2) insufficiently characterized (3) characterized, but resulting model is too complex to use, and therefore RL simultaneously explores the environment in simple ways and takes actions to maximize some objective function.

However, there are many environments (chemical/power plants, machines, etc.) where there are good mathematical/empirical data-based models, where model-based optimal control works extremely well in practice (much better than RL).

I'm wondering why the ML community has elected to skip over this latter class of problems with large swaths of proven applications, and instead have gone directly to RL, which is a really hard problem? Is it to publish more papers? Or because self-driving cars?*

(* optimal control tends to not work too well in highly uncertain, non-characterized, changing environments -- self-driving cars are an example of one such environment, where even the sensing problem is highly complicated, much less control)

3 comments

svalorzen 2680 days ago

RL is actually quite an umbrella term for a lot of things. There's policy gradient methods, which improve directly on the policy to select better actions, there's value based methods which try to approximate the value function of the problem, and get a policy from that, and there's model based methods which try to learn a model and do some sort of planning/processing in order to get the policy.

Using model based methods can allow you to do some pretty fancy stuff while massively reducing the number of data samples you need, but on the other side there's a trade off. Using the model usually tends to require lots of not-very-parallelizable computations, and can be more costly computationally. Very large problems can get out of hand pretty quickly, and there's still a lot of work to do before there is something which can be applied in general quickly and efficiently.

link

wenc 2680 days ago

Thanks for the insight on RL. That's good context for me.

I would say though that from my experience, computational cost is rarely the issue with model-based control, because there are various attacks ranging from model simplification (surrogate models, piecewise-affine multi-models i.e. switching between many simpler local models, etc) to precomputing the optimal control law [1] to embedding the model in silicon. Also, some optimal models/control laws can actually parallelized fairly easily (MLD models are expressed mixed-integer programs which can be solved in performant ways using parallel algorithms, with some provisos). This is a well-trodden space with a tremendous amount of industry-driven research behind it.

Most of these methods come under the Model Predictive Control (MPC) umbrella which has been studied extensively over 3 decades [2]. The paradigm is extremely simple: (1) given a model of how output y responds to input u, predict over the next n time periods the values of u's needed to optimize an objective function. (2) Implement ONLY the first u. (3) Read the sensor value for y (actual y in real world). (4) Update your model with the difference between actual y and predicted y, move the prediction window forward, and repeat (feedback). When this is applied recursively, you obtain approximately optimal control on real-life systems even in the presence of model-reality mismatch, noise and bounded uncertainty.

If you think about it, this is the paradigm behind many planning strategies -- forecast, take a small action, get feedback, try again. The difference though is that MPC is a strategy with a substantial amount of mathematical theory (including stability analysis, reachability, controllability, etc.), software, and industrial practice behind it.

[1] Explicit MPC http://divf.eng.cam.ac.uk/cfes/pub/Main/Presentations/Morari...

[2] https://en.wikipedia.org/wiki/Model_predictive_control

link

breatheoften 2680 days ago

(disclaimer: I am not a RL researcher) I think grandparent was using 'model' to refer to model-based or 'value-based' reinforcement learning algorithms (as distinct from 'model-free' methods (ex: 'policy-based' methods)). I don't think they were directly referring to the same 'model' as is meant by MPC.

In RL, the goal is to try to find a function that produces actions that optimize the expected reward of some reward function. Model-based RL methods typically try to extract a function for 'representing' the environment and employ techniques to optimize action selection over that 'representation' (replace the word 'representation' with the word 'model'). Model-free RL methods instead try to directly learn to predict which actions to take without extracting a representation. A good paper describing deep q-learning -- a commonly cited model-free method that was one of the earliest to employ deep-learning for a reinforcement learning task [1].

I think it's worth clarifying -- RL algorithms as a whole are more akin to search than to control algorithms. RL algorithms can be used to solve some control problems -- but that is not all they are used for unless you take an extremely broad view about what constitutes a 'control problem' ... I don't think it would be common to model playing 'go' as a control problem for example -- nor would I consider learning how to play all atari games ever created given only image frames and the current score and no other pre-supplied knowledge to be a control problem ...?

(I'm talking way passed my familiarity now) -- That said, optimal control theory intersects with RL quite a bit in the foundations -- Q-Learning techniques (a foundational family of methods in RL) have proofs that show under what conditions they will converge on the optimal policy -- I believe this mathematics to be quite similar to the mathematics used in optimal control theory...

[1] Deep Q-Networks https://storage.googleapis.com/deepmind-media/dqn/DQNNatureP...

link

wenc 2680 days ago

Thanks for sharing some really interesting thoughts. Just to add on to your comment...

The goal of optimal control is broadly similar to RL in that it aims to optimize some expected reward function by optimizing action selection for implementation in the environment.

The difference is the optimal control does not seek to learn either a representation or a policy in real-time -- it assumes both are known a priori.

Both can be thought of as containing hidden Markov models, though in optimal control the transition functions are assumed to be known whereas in RL they are unknown.

Another difference is that in control theory, we assume there is always a model -- though some models are implicit. You see, control algorithms either assume that the environment is explicitly characterized (model-based, like MPC), or that the controller contains an implicit model of the environment (internal model control principle, i.e. we adjust tuning parameters in PID control... there's no explicit model, but a correctly tuned controller behaves like a model-inverse/mirror of reality). In either of these cases, either the implicit or explicit model are arrived at before hand -- once deployed, no learning or continual updating of the controller structure is done.

In contrast, RL has an exploration (i.e. learning) component that is missing from most control algorithms [1], and actively trades-off exploration vs exploitation. In that sense, RL encompasses a larger class of problems than just control theory, whereas control theory is specialized towards the exploitation part of the exploration vs exploitation spectrum.

[1] Though there are some learning controllers like ILCs (iterative learning control) and adaptive controllers which continually adapt to the environment. They have a weakness (perhaps RL suffers from the same) in that if a transient anomalous event comes through, they learn it and it messes up their subsequent behavior...

link

breatheoften 2680 days ago

I’m not sure how comparable adaptive control theory notions are to “reinforcement learning”. Adaptive obviously isn’t a perfectly defined word — but your usage makes me think you might be pondering applying RL to non-stationary environments which I’m not sure is something RL would currently be necessarily likely to perform well for - many reinforcement learning techniques _do_ require (or at least perform much better) when the environment is approximately stationary — of course it can be stochastic but the distributions should be mostly fixed or else convergence challenges are likely to be exacerbated.

link

currymj 2680 days ago

If you have a good model, and can use model-based optimal control which has been understood for decades, then that is good but there's also not really a research problem? You can just do the simple, robust thing and it will work great. (i.e. "to publish more papers" is actually a legitimate reason if your job is explicitly to publish papers)

You may enjoy the article, "A Tour of Reinforcement Learning: The View from Continuous Control". At least that researcher would agree that people doing RL don't pay enough attention to "classical" control.

https://arxiv.org/abs/1806.09460

link

wenc 2680 days ago

> there's also not really a research problem?

This is also my suspicion. :) But to ignore optimal control altogether makes me suspect many AI researchers aren't familiar with the body of research, and many who've managed a cursory read of Wikipedia may believe that the state of the art in optimal control are LQRs and LQGs, when it's really MPC (which can be thought of as a generalization of LQRs).

Also, MPC is a model-type and optimization-algorithm agnostic paradigm, so there's plenty of ways to combine models/algorithms within its broad framework -- this is partly how many MPC researchers come up with new papers :). I think AI researchers should take a look at it in complement with RL for the problems they're trying to solve.

Thanks for the link to the paper -- I will take a look.

link

computerphage 2680 days ago

Do you have an example of a self driving car company that uses RL?

link

wenc 2680 days ago

Nope.

link