Hacker News new | ask | show | jobs
by verdatel 5411 days ago
The theory of multi-armed bandits arises from the study of Markov decision processes (MDP). This field is also closely studied in and related to Stochastic Control and Dynamic Programming. A good book on MDP's is "Markov Decision Processes: Discrete Stochastic Dynamic Programming by Martin Puterman". It's fairly mathematical and is for grad-level learning. An alternative is another superb book "Dynamic Programming and Optimal Control" by Dimitri Bersekas.

A more accessible book is "Approximate Dynamic Programming" by Warren Powell.

I'm not sure if you know what a Markov chain is? Very roughly, any kind of data which changes its value (sometimes referred to as "state") and jumps to a new value is a Markov chain. The value that the chain jumps to is determined probabilistically using some distribution over all the states. When there are several Markov chains in parallel and each of them contribute to a certain reward and we are faced with the problem of which Markov chain to "control" or influence.. then it becomes a multi-armed bandit problem. They are not very easy to solve, except when certain simplifying assumptions are made. You can get a good start by looking at some MATLAB code from Prof. Kevin Murphy at UBC : http://www.cs.ubc.ca/~murphyk/Software/MDP/mdp.html