Hacker News new | ask | show | jobs
by nafizh 2853 days ago
With all of its excitement surrounding RL, I am yet to see substantial practical applications of RL in real life apart from games, and some articles I read on how companies use RL for recommendation or ad suggestion.

So, it is indeed kind of puzzling to understand as an outsider what generates this excitement.

7 comments

Reinforcement learning is a solid fit for a number of traditional operations research problems. (In fact, that's pretty much what motivated research into RL in the first place, as I understand it.)

One concrete example I heard about was using reinforcement learning to price airline tickets. Behind the scenes, airlines break up the tickets on a single plane into a large number of distinctly priced types. The question of how much of each type of ticket to offer at what price and how to change this over time (as the actual flight is coming closer and closer) is a massive optimization problem that's too large to solve exactly. Reinforcement learning coupled with simulation can find good solutions if you set up the feature space correctly. (In this case, I remember that the only feature that ultimately mattered was either total profit or total revenue for the mix of tickets being offered.)

One thing to note here is that this is using "normal" reinforcement learning, not "deep" reinforcement learning. You can get away with having a simple functional approximation of the state instead of reaching for a neural network. This seems true for most operations research problems where you would reach for reinforcement learning—figuring out a way to model the state by hand works well enough and has the important benefit of being easier to understand and interpret. The "deep" part becomes useful when your state space is so large and complex that other techniques become infeasible.

One of the core equations in reinforcement learning is the Bellman equation -- named after Richard Bellman, inventor of dynamic programming. And in fact, in the operations research community, reinforcement learning is often referred to as "approximate dynamic programming". There are lots of extremely boring, quite effective techniques for solving real industrial problems in this framework, without any neural networks at all.

As for why so much excitement about "deep RL", when it hasn't done anything substantive outside of games -- I think it's because it has the possibility of working in wildly different domains with minimal modification. We can sort of see some of this already -- OpenAI used the same training algorithm to play DOTA2 as they did to train a robotic hand to manipulate blocks.

I have no idea whether the hype will actually pan out, but that generality is worth being cautiously excited about, I think.

Researchers have used reinforcement learning techniques to train neural models with non differentiable loss functions.

For instance, a common practice is to: 1. train a translation model with standard cross entropy loss 2. fine tune the model with reinforcement learning against BLEU scores.

BLEU is generally the metric thats reported in translation papers, but it can't be used as an objective function since it isn't differentiable.

Reinforcement learning can be used to generate gradients from these types of objectives.

Game playing, recommendations and ad suggestions are all examples problems with non differentiable objectives, but there are many more types of these problems that I think we haven't explored as deeply, that can potentially be solved with RL.

Here's a two sentence summary of SOTA:

- model free methods have seen great success in terms of learning high dimensional tasks however it suffers from being sample inefficient. In other words, it takes too long for real robots. Examples of these methods are TRPO, PPO, ES, etc

- model based methods is an order of magnitude more efficient, and thus, are more practical on real world robots. However, these methods have high bias and most working models are simple in terms of representation power, e.g. GP, time varying linear, mixture of Gaussians,. Examples are PILCO, GPS, PETS, etc

Of course, SOTA is a lot more complicated but it's a short explanation to your observation.

RL is equivalent to optimal control in many real-world systems like robotics, self-driving vehicles, and other complex systems (aircraft control, etc.). There are lots of practical applications for RL, but it doesn't always work well; solving even simple problems can often be deceptively complex—to the point of being intractable, unstable, or both.
to my knowledge, not a single self-driving car company is using RL in production. it is a more theoretic methodology than anything else.
Serious A/B testing is usually done with a bandit algo these days, you can converge with far fewer trials than a statistically meaningful A/B. News or other articles recommendations are often bandits too - MSN being a prime example.
Spoken dialog systems.

Check out Steve Young's group at Cambridge, it's considered the state of the art (well, at least by some ;-)).