RL is a good theoretical solution for personalization: given a user state, select an action that maximizes a long term reward (eg. revenue/engagement.) It’s tricky building the implementations because unlike Go/Chess/Atari it’s hard to simulate humans. So you have to train the agents with batches of data offline (ie. using historic data from the agent’s past actions.) This is challenging because you don’t get as many chances to try different hyper parameters. It’s starting to be used more in industry though.
I’ve not kept up with the recent developments in this field - is Vowpal Wabbit widely used now? Any competitors? Or do people build their own in-house systems?
When the first PC with Basic launched in the 80s many people wanted to develop for it.
When the iPhone Appstore launched, many people started to build apps in the ecosystem.
While it might be it bit too early to compare RL to those advances in technology. I personally feel there is huge potential. I might be wrong though. And I am fine with that.
RL needs a supercomputer and its code is usually too fragile - making a trivial mistake anywhere (missing a constant multiplication, swapping the order of two consecutive lines of code etc.) would likely lead to your model never converging even if you got everything else right.
The hard part of RL for the problems I've encountered in my work is that you need a simulator. Building a reliable and accurate simulator is often an immense undertaking.
In my opinion there's a wide open array of approaches from control that can help with this. Learning for Control is a new conference that looks at this very topic.