| I have mixed feelings about using multi-armed bandit for product testing like this. Regret minimization makes sense 100% as a framework if you are testing a large inventory of things - i.e. the classic examples of showing ads or recommendations - since there might be some real opportunity cost in not showing some of the things in inventory (particularly if the inventory has a shelf life). (I'm also quite surprised they don't use thompson sampling...) For testing product features though, I feel like there is often a high long term cost to the dev team and the regret from showing users a non-optimal treatment during the experiment is pretty minimal (the regret is usually to first order only the cost of experimental bandwidth). The team cost comes in several subtle forms: - in practice, bandits encourage lots of small experiments which leave behind a large surface area graveyard of code - you can mitigate this by having strict stopping points for bandit experiments - bandits have higher statistical power, but also higher false-positive rate; false positives can be quite high cost since they cause thrash and require time to investigate if a feature that tested well does poorly in production - you are introducing novelty effects over time as new sample groups get added in the dynamic allocation; probably nbd for most experiments, but it's complicated to correct for this if your experiment has novelty effects - there are often cyclical time-dependent changes in the composition of users being exposed (daytime vs night time, week day vs weekend, geography bc of timezone differences); also, probably nbd for most experiments, but requires complex stratification to correct for if this is an issue I would also say that the majority of product changes have small, but measurable effects on metrics, so I'm not sure that bandits help all that much in those cases. If there are runaway successes or failures, early stopping techniques seem like a better way to free up resources - early stopping policies can be tuned to address the experiment design problems above fairly simply. Again, this is all for product testing. I think for recommendations and personalization, contextual bandits make lots of sense. |
Half of the article talks about how they use Thompson Sampling