|
|
|
|
|
by sparsevector
5127 days ago
|
|
> I believe that I can produce a multi-armed bandit algorithm that has logarithmic regret and will work in the face of ever varying conversion rates, as long as there is a consistent relative difference between the versions. That sounds interesting. As you undoubtedly know, there is a lot of literature on multi armed bandit problems and also on the non-stochastic multi armed bandit problem. The latter has the advantage of not assuming anything about the way in which the sequence of rewards is generated. There is a a line of work in the non stochastic setting which sounds related to what you are describing. It's called the problem of tracking the best expert: http://www.cse.ucsc.edu/~mark/papers/track-long.ps
The idea in this problem is to do well as compared not to the best fixed action but rather the best piecewise constant sequence of actions. There are many variations of this problem, but the basic idea is that you can have low regret as compared to a changing sequence of actions so long as that sequence doesn't change too "quickly" for some definition of quickly. |
|
Here is the specific problem that I am interested in.
Suppose we have a multi-armed bandit where the probability of reward are different on every pull. The distribution of rewards are not known. It is known that there is one arm which, on every pull, has a minimum percentage chance by which it is more likely to produce a reward if pulled than any other arm. Which arm this is is unknown, as is the minimum percentage.
In other words payoff percentages change quickly, but the winner is always a winner, and your job is to find it. I have an algorithm that succeeds with only logarithmic regret. I have no idea whether anyone else has come up with the same idea, and there is so much literature (much of which I presume is locked up behind paywalls) that I cannot easily verify whether anyone else has looked at this variant.