|
|
|
|
|
by sparsevector
5127 days ago
|
|
I think what you're describing is basically the non-stochastic setting. The idea there is to assume absolutely nothing about the rewards (they can be generated by quickly changing distributions or even an adversary) but still try to do as well as the best single arm. The standard algorithm for this problem is the multiplicative weights update method, also called the exponentially weighted average method or the hedge algorithm. It's a really simple method where you keep a probability distribution over the arms and then reweight the probabilities at each time step using something proportional to the exponential of the rewards of the arms. If you run it for T time steps on a problem involving N arms, it gives you regret O(sqrt(T log N)). This is actually the best possible bound without additional assumptions. This is a good paper on it: http://cns.bu.edu/~gsc/CN710/FreundSc95.pdf There are many, many different variations of the method and result. For example, you can show improved results assuming different things about the rewards. For exapmle, you can show O(log(T)) regret with certain additional assumptions. This is a really good book on different non-stochastic online learning problems:
http://www.ii.uni.wroc.pl/~lukstafi/pmwiki/uploads/AGT/Predi... |
|
The book looks like it would take a long time to work through, and they don't get to the multi-armed bandit problem until chapter 6. It goes into my todo list, but is likely to take a while to get off of it...