| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by cschmidt 731 days ago

It would probably be good to have something considering multiple comparisons (False Discovery Rate, Bonferroni correction), which is often the bane of running a whole series of A/B tests. And, as another poster has mentioned, an anytime approach that is resistant to early stopping due to peaking [1].

For those who haven't read about Fisher's tea experiment: There was a woman who claimed she could tell if the milk was put into the cup before or after pouring the tea. Fished didn't think so, and developed the experimental technique to test this idea. Indeed she could, getting them all right iirc.

[1] see https://media.trustradius.com/product-downloadables/UP/GB/AD... for a discussion of the problems with a t-test. There is also a more detailed whitepaper from Optimizely somewhere

3 comments

gatopingado 731 days ago

For anyone interested in anytime-valid testing, I wrote a Python library [1] implementing multinomial and time inhomogeneous Bernoulli / Poisson process tests based in [2].

[1] https://github.com/assuncaolfi/savvi/

[2] https://openreview.net/forum?id=a4zg0jiuVi

link

e10v_me 731 days ago

I thought about multiple comparison corrections. Here what my thoughts were:

1. Experiments with 3 or more variants are quite rare in my practice. I usually try to avoid them.

2. In my opinion, the Bonferroni correction is just wrong. It's too pessimistic. There are better methods though.

3. The choice of alpha is subjective. Why use a precise smart method to adjust a subjective parameter? Just choose another subjective alpha, a smaller one :)

But I can change my opinion if I see a good argument.

link

cschmidt 731 days ago

If you work for a large website (as I used to), they probably run hundreds of tests a week across various groups. So false positives are a real problem, and often you don't see the gain suggested by the A/B when rolling it out.

I agree that Bonferroni is often too pessimistic. If you Bonferroni correct you'll usually find nothing is significant. And I take your point that you could adjust the $\alpha$. But then of course, you can make things significant or not as you like by the choice.

False Discover Rate is less conservative, and I have used it successfully in the past.

People have strong incentives to find significant results that can be rolled out, so you don't want that person choosing $\alpha$. They will also be peaking at the results every day of a weekly test, and wanting to roll it out if it bumps into significance. I just mention this because the most useful A/B libraries are ones that are resistant to human nature. PM's will talk about things being "almost significant" at 0.2 everywhere I've worked.

link

e10v_me 730 days ago

Thank you for explanation and for drawing a vivid picture) I will add FWER and FDR to the roadmap. Which specific controlling procedures do you find the most useful on practice?

I'm considering the following: - FWER: Holm–Bonferroni, Hochberg's step-up. - FDR: Benjamini–Hochberg, Benjamini–Yekutieli.

link

cschmidt 730 days ago

Personally, I've used FDR, but FWER is meant to be good as well. I guess I don't have a preference.

link

welder 731 days ago

And the Student's t-test which was named so because William Sealy Gosset's employer (Guinness beer) allowed him to publish it anonymously, so he published using the pseudonym "Student".

link