Hacker News new | ask | show | jobs
by chkgk 1237 days ago
The study apparently only involved male participants. The control group consists of 6, the two treatment groups of 10 and 9 participants. The total N of the study is 25 participants. They conduct a one-way ANOVA on the interaction of time and treatment group indicators. I conclude that the study is woefully underpowered. I do not trust the apparent significance of their results.
4 comments

I really don't understand statistics. My interpretation of "the study is under-powered" is that since the study has so small groups it will be difficult to find any result that is statistically significant. But wouldn't that mean that for any effect to be significant the effect size would have to be huge?

My hunch is that if you have a large enough group even very small effect sizes will be significant, but in small groups only the very largest effect sizes will be significant. Or am I simply bad at statistics?

(See bio for my background) All of the replies you've gotten so far are very good and I upvoted all of them.

In particular:

cuchoi acknowledges the publication bias version of the risk here. Let's say your average effect is 1 unit with a confidence interval that is 0.9 units at a desired level of confidence. We can interpret this confident interval two different ways: one is that assuming 1 unit is the true effect, then repeated sampling would produce a sampling distribution of estimate effects that span 0.1 to 1.9 at the desired level of confidence. Another is assuming that, say, 0.1 was the true effect, effects as large as the one we see (1 unit) would occur a non-trivial portion of the time. Now, imagine many researchers do this experiment and the true effect is 0.1. Some researchers find negative effects, some find small effects that are not significant, others do larger studies and find small effects that are significant, others find larger effects. Now, imagine the journal will only publish effects that are both statistically significant and substantively interesting. The only person that submits for publishing is the version of the study that finds the large effect (1 unit). cuchoi is very correct to suggest that when your design can only find large effects, the published effect will likely be overestimated.

fpoling and sandgiant highlight the sensitivity risk argument. Suppose that the outcome is heavily sensitive to some confounders (socioeconomic status, nutrition, smoking status, race, etc.) And suppose poor people are slightly more likely to get treatment, just from coin flip chance. Because poverty correlates with both the effect and the probability of being treated (even though you tried to assign randomly), some of the visible effect is the relationship between poverty and treatment, not effect and outcome. There are designs other than simple randomization that try to explicitly deal with known confounders, but they can't deal with unknown confounders. Larger sample sizes mitigate the risk of imbalance of both known and unknown confounders.

Everyone is doing great!

Under-powered also means that the minimum detectable effect is high (that's why it is harder to get a "significant" result).

Which means that it is more likely that if you will find an effect only if you are overestimating the real effect. The real effect might not even be detectable!

There might be someone able to better explain this, but one way it could work wold be to say that for small samples any bad assumptions you make (such as variables or measurements being independent) will affect the result more than if you had a larger sample.

The assumption I make to make this work is that dependencies are more likely to be drowned out by internal variation in a larger sample. So you get to pick which assumption you like better.

A simple rule of thumb is that for samples with N < 100 be very skeptical for the results as those can be archived simply by randomness on top of small systematic errors. Proper statistics helps to rule out randomness, but not systematic errors. Which pretty much rules out most of the sport studies.
I studied this many years ago, but you have formulas for survey sizes based on confidence levels and, if I recall correctly, the number of variables you want to study.

People publishing these studies should know and use these formulas, but I imagine there's a lot of pressure to publish high impact/visibility stuff so they just go for the cheapest and fastest (aka wrong) approaches some times.

short answer: its complex and there are books on the topic.

lesser-disappointing answer:

You have a hypothesis how STUFF works differently when you make an intervention (experiment, i.e. collect data, change something or go to the control group, collect more data).

Your default assumption is that your experiment won't show a meaningful difference, OR it could show a difference (positive/negative). Now what you observe may not be the reality. Which leaves you with 4 possible situations:

False-positive, true-positive, false-negative, true-negative

Most statistical methods used in data analysis take great care to minimize the probability for a false positive (probability our methods yields 'positive', when in fact there is no effect in reality. This probability is the famous 'p Value' (sometimes p Value also refers to a threshold of this probability).

So when you do certain statistical tests, you receive a p-Value, apply a threshold consideration p<5% for example, this means that you assume that only every 20th experiment where in reality there is no effect results in a 'significant' finding (i.e. a false-positive).

So naively increasing your sample size will not lower your false-positive probability if-and-only-if your analysis method corrects for it. However the sample size strongly influences the false-negative rate, i.e. a Student t-Test with p<0.05 will with sample size N=3 yield a false-positive with still a 5% probability, which in practice then means, that there is a slim chance to get a true-positive results.

The criticism here about sample size does from this perspective not make too much sense, however: we need to keep in mind:

A) There is a whole field of problems about controlling variables (i.e. adding more columns to your data table). Each variable adds another dimension to your problem, and this quickly leads to a 'curse of dimensionality' problem. Is the observed effect explained by your experimental intervention, or is it in differences between your control group and your study objects (sex/gender/socioeconomic status/age/training level/ overall health). Quickly not being able to control for a variable can lead to false-positive results.

B) complexity of the method at play. The study uses ANOVA (analysis of variance). Its been years that I last looked at it so I am not making statements here.

C) Crucially: Many methods actually assume Normally-distributed data (Gaussian distribution). However, if you collect data it is rarely normally-distributed, one can use methods for normally-distributed data on non-normally-distributed data because of the "law of large numbers", i.e. mixtures of non-normally-distributed datasets typically tend to end up being normally-distributed. but this does not happen at N=10.

There are a few finer points to mention here, which is that many HN commenters have a machine-learning background and may be a bit biased against smaller-sample-size studies for multiple reasons that are specific to what they are used to in the machine-learning world. And on the other hand, from my experience majoring in biophysics, many health-related studies on sports and obesity really have low-quality stats and overestimate the predictive power of their datasets.

tl;dr: I would only conclude from this study that HIIT is better than nothing, not that it is better or worse than other cardio exercise.

PS: The above text tries to break down complex stuff and thereby by definition contains mistakes.

Every time I see a study with a big headline, it almost always has double digit participants.

Very callous behavior.

First studies with new findings should be small, surely, so we can weed out effects to test with larger studies. The question is whether the larger studies are being done? The incentives are such that democratic governments need to be leading the research, IMO.

How do we ensure at a national level (because studies probably need to be repeated in different nations) we're doing good science, backing up key results, informing the population what the better ways of behaving are?

Sports science is probably the most unintentionally hilarious branch of statistics.
Everything around nutrition in general... But yeah with sports your throwing in groups of insanely driven outliers. Even more fun is mapping that back on the average human like much of this post is doing. No you should not try and follow Michael Phelps pre Olympics training routine.
I'll do you one better: all of the subjects were untrained. Any kind of training will improve fitness in virtually every single marker that can be measured. Your aerobic fitness will improve from lifting heavy weights. You'll get stronger from running. These studies are totally useless.