Hacker News new | ask | show | jobs
by gwerbret 407 days ago
> Stopping an experiment once you find a significant effect but before you reach your predetermined sample size is classic P hacking.

Although much of the article is basic common sense, and although I'm not a statistician, I had to seriously question the author's understanding of statistics at this point. The predetermined sample size (statistical power) is usually based on an assumption made about the effect size; if the effect size turns out to be much larger than you assumed, then a smaller sample size can be statistically sound.

Clinical trials very frequently do exactly this -- stop before they reach a predetermined sample size -- by design, once certain pre-defined thresholds have been passed. Other than not having to spend extra time and effort, the reasons are at least twofold: first, significant early evidence of futility means you no longer have to waste patients' time; second, early evidence of utility means you can move an effective treatment into practice that much sooner.

A classic example of this was with clinical trials evaluating the effect of circumcision on susceptibility to HIV infection; two separate trials were stopped early when interim analyses showed massive benefits of circumcision [0, 1].

In experimental studies, early evidence of efficacy doesn't mean you stop there, report your results, and go home; the typical approach, if the experiment is adequately powered, is to repeat it (three independent replicates is the informal gold standard).

[0]: https://pubmed.ncbi.nlm.nih.gov/17321310/

[1]: https://pubmed.ncbi.nlm.nih.gov/16231970/

8 comments

https://commons.m.wikimedia.org/wiki/File:P-hacking_by_early...

The author is absolutely correct. Early stopping is a classic form of p hacking. See attached image for an illustration.

If you want to be rigorous, you can define criterion for early stopping such that it's not, but you require relatively stronger evidence.

Clinical trials that stop early do so typically at predefined times with higher significance thresholds.

The region where `p` hits the red line should be called "publish or perish".
There are of course statistical methods designed to support early stopping. But I don’t think you can use a regular p-test every day and decide to stop if p < 0.05. That’s something else.
You use full both sided ANOVA F test with multiple comparison correction for that. Even these tests are sometimes not conservative enough, because the correction is a bit of a guess.

You will end up with much higher number of trials required to hit the P value than the version with predetermined number of trials and no stopping point by P.

Say, in a single variable single run ABX test, 8 is the usual number needed according to Fischer frequentist approach. If you do multiple comparison to hit 0.05 you need I believe 21 trials instead. (Don't quote me on that, compute your own Bayesian beta prior probability.)

The number of trials to differentiate from a fair coin is the typical comparison prior, giving a beta distribution. You're trying to set up a ratio between the two of them, one fitted to your data, the other null.

The general topic and some specific ways to estimate a correction are described under this term: https://en.wikipedia.org/wiki/Sequential_analysis
Multiple comparisons and sequential hypothesis testing / early stopping aren't the same problem. There might be a way to wrangle an F test into a sequential hypothesis testing approach, but it's not obvious (to me anyway) how one would do so. In multiple comparisons each additional comparison introduces a new group with independent data; in sequential hypothesis testing each successive test adds a small amount of additional data to each group so all results are conditional. Could you elaborate or provide a link?
No, it's generally not valid -- it will depend on the specifics of the test (especially if the test is valid only asymptotically). You need some method that supports sequential inference. Nowadays your best bet is probably some sort of anytime-valid method from the e-value literature https://en.wikipedia.org/wiki/E-values https://projecteuclid.org/journals/statistical-science/volum...
> I had to seriously question the author's understanding of statistics at this point.

I think you may want to start the questioning closer to home.

Early stopping is fine as long as the test has been designed with the possibility of early stopping in mind and this possibility has been factored in the p - value formulation.

In lots of human studies, you can’t just stop at an arbitrary number of participants because you’ve counterbalanced manipulations to decorrelate potential confounders (e.g., which color stimulus is paired with reward, the order of trials).
The distinction is between ‘data peeking’, i.e. repeatedly checking the p-value you've obtained and stopping if it falls below 0.05, and repeating assays in the light of new information. Such new information can relate to the distribution of the values, the expected effect size, or any other parameter that you did not know at the outset of the study.

In ‘data peeking’, the flaw is that if an assay is repeated often enough, one will eventually get a result that deviates far from the mean result. This is a natural consequence of the data having a normal distribution, i.e. not all results will be identical. It's the equivalent of getting six heads or tails in a row (which should happen at least once if you flip a coin 200 times), and then reporting your coin as biased.

Repeating an assay because the distribution of the data is not what you thought, or because the likely difference between means is smaller than you thought is a valid approach.

Source: Big little lies: a compendium and simulation of p-hacking strategies Angelika M. Stefan and Felix D. Schönbrodt

https://royalsocietypublishing.org/doi/10.1098/rsos.220346

There is another reason to keep clinical trials as long as designed. To understand the safety and side effects implications.
Sounds like a variable cost experiment. Each observation cost x$. Like an A/B split on Google ads. Why keep paying for A when you know B is better already.
Small samples have more variability than large samples and thus more often show spurious large effects.
So you end up with a higher threshold for confidence at p<0.05 ot whatever you want p to be under. Comes out in the maths!

Toss a coin 10 times comes up heads 10 times. There is a 1 in 2^10 (approx 1000) that happens by chance for an unbiased coin.

I'm convinced it is biased.

20 times I am freaking convinced.

I don't need another 1000 tosses.

It’s more like you are supposed to toss 1000 times and after 500 tosses you get a lucky streak of 5 heads in a row and then decide to end experiment and conclude that coin is biased.
Oh yeah. Don't do that! Look at all 500 tosses.
Google Optimize used to tell you to let an experiment run for one-two weeks (?), exactly because early strong results tend to not don't hold up in the long run.

-> https://en.wikipedia.org/wiki/Regression_toward_the_mean

Seasonality effects, too