| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by meigetsu 2110 days ago

I have mixed feelings about using multi-armed bandit for product testing like this. Regret minimization makes sense 100% as a framework if you are testing a large inventory of things - i.e. the classic examples of showing ads or recommendations - since there might be some real opportunity cost in not showing some of the things in inventory (particularly if the inventory has a shelf life). (I'm also quite surprised they don't use thompson sampling...)

For testing product features though, I feel like there is often a high long term cost to the dev team and the regret from showing users a non-optimal treatment during the experiment is pretty minimal (the regret is usually to first order only the cost of experimental bandwidth).

The team cost comes in several subtle forms:

- in practice, bandits encourage lots of small experiments which leave behind a large surface area graveyard of code - you can mitigate this by having strict stopping points for bandit experiments

- bandits have higher statistical power, but also higher false-positive rate; false positives can be quite high cost since they cause thrash and require time to investigate if a feature that tested well does poorly in production

- you are introducing novelty effects over time as new sample groups get added in the dynamic allocation; probably nbd for most experiments, but it's complicated to correct for this if your experiment has novelty effects

- there are often cyclical time-dependent changes in the composition of users being exposed (daytime vs night time, week day vs weekend, geography bc of timezone differences); also, probably nbd for most experiments, but requires complex stratification to correct for if this is an issue

I would also say that the majority of product changes have small, but measurable effects on metrics, so I'm not sure that bandits help all that much in those cases. If there are runaway successes or failures, early stopping techniques seem like a better way to free up resources - early stopping policies can be tuned to address the experiment design problems above fairly simply.

Again, this is all for product testing. I think for recommendations and personalization, contextual bandits make lots of sense.

2 comments

trumpeta 2110 days ago

> I'm also quite surprised they don't use thompson sampling

Half of the article talks about how they use Thompson Sampling

link

meigetsu 2110 days ago

huh, wierd - I saw this post in Aug and can't understand how I missed that. Thanks for pointing that out - it does indeed discuss it.

link

stdbrouw 2110 days ago

> bandits have higher statistical power, but also higher false-positive rate; false positives can be quite high cost since they cause thrash and require time to investigate if a feature that tested well does poorly in production

Not sure what you mean by this. Higher false-positive rate compared to what? And given that bandits do not run for a predefined amount of time but converge at a rate proportional to the evidence (as opposed to your typical AB-test), a higher rate at which point in time?

Perhaps you mean that, because bandits typically run longer, there's a higher chance that they'll select an alternative that offers only a marginal improvement on the status quo whereas short experiments would just say "nah, no evidence that one is better than the other" and thereby get rid of a lot of noise?

link

meigetsu 2110 days ago

Thank you - your comment is right and I conflated two things which are conceptually totally different.

For a given number of experiments and block of time (i.e. available samples over time), it's not useful to say that bandits have higher power / a worse FPR, bc the values are adjustable. F1 or AUC would probably be the right way to compare and it seems likely to me that bandits have better performing precision-recall curves. Basically, this is actually irrelevant to the point, and actually favors bandits.

I was totally thinking about the scenario you mentioned where the number of experiments are unconstrained and old experiments run long. Bandits will spend a lot of their bandwidth on very marginal improvements that are below the effect size cutoff that shorter fixed RCT will set. I think you can fix this with early stopping (or just stopping), so maybe it's not really an issue after all.

Thanks for helping clarify my thinking on this :)

link