Hacker News new | ask | show | jobs
by retroam 3925 days ago
Wish the study was not behind a paywall but...5 human subjects?
1 comments

> Wish the study was not behind a paywall but...5 human subjects?

Its an experimental study, the number of subjects isn't really that important. I can't access the full text, either, but I assume they did an ABAC test pattern (control, treatment caffeine, control, treatment caffeine + bright light) or something similar with all 5 subjects simultaneously.

Generally speaking, you really only need many participants for field studies, e.g. situations where you cannot control most variables beside treatment itself. The assumption is that the Law of large numbers takes care of equal distribution of those confounding variables between treatment group and control group.

"Its an experimental study, the number of subjects isn't really that important."

I'd be interested to understand why this is? My logical reaction would be that it's always important - as a crude example, surely doing an experiment on every single human on Earth would give you much more accurate results that on say 100 people, because you'd be sure to have covered all the innate variables that exist when experimenting with humans? (different metabolisms, etc)

Depends on what kind of generality of statement you're trying to make. Here, the generality might be a cause and effect one in which case you're attempting to generalize over possible future treatments and you attack detractors who might yell "that was a fluke!" or "it wasn't the caffeine, but instead the presence of the doctor!". To do this, you design an experiment which carefully controls for all expected irrelevant interactions and then show a response which is significantly different from random variation.

You end up limited, as you note, to your population. 5 people won't defeat detractors who believe that this effect is limited within some, e.g., metabolic profile but it ought to give them serious food for thought as to how wide the affected metabolic profile actually is.

If these 5 volunteers were chosen at random, then the potential generality of effect can still be large as a detractor would be fighting, at best, with the notion that the 5 chosen were circumstantially susceptible to this effect (as compared to a study of convenience where one might believe that "college students" or "hospital volunteers" are especially susceptible).

So, in a certain sense, testing every human on earth improves the power of the statement you can make (not really its "accuracy" though maybe its "precision", in a sense), but in many other ways that may be too expensive for the kind of result the author seeks.

I have a feeling this test is not so much about accuracy as it is tendency. I.e. they avoid entirely specific projections, instead they want to see if the effect can be observed even once and thus have groundwork for further investigation.
>> "Its an experimental study, the number of subjects isn't really that important."

I'd be interested to understand why this is?

I replied to cossatot below in more detail. The short version: In studies like this one, N isn't 5, but humans(e.g. the original N) x treatment repetitions x measurement points.

You are right that it would be dangerous to ignore it, that wasn't what I implied.

It'd be completely accurate, you would just need to keep every single human on Earth in controlled conditions. Good luck!
No, this is wrong. With small sample sizes you may get a statistically significant result, but it still might not be a real result and might not be reproducible. This is a major issue in science today and why a lot of studies can't be replicated.
> No, this is wrong. With small sample sizes you may get a statistically significant result, but it still might not be a real result and might not be reproducible. This is a major issue in science today and why a lot of studies can't be replicated.

Reproducability indeed is a major problem, but looking at statistical significance alone isn't the cure (especially if applied a posterior).

We should rather look at effect sizes and robust study designs.

In fact, modern studies aiming for causality often calculate the population size needed for statistical significance beforehand. It's a standard formula in most textbooks. You only need the expected effect size and then can calculate the population needed to guarantee significance.

Statistically significant means statistically significant and is independent of sample size. If your p-value is less than 0.01, then there's less than a 1% chance that the pattern you're seeing is due to random fluctuations of the variable itself that you cannot predict.

The problem is that the statistical model (in my field we do a lot of ANOVA and t-tests, along with the occasional chi-square) can only account for what you model. So there could be some kind of systematic error that influences your results in a fashion that is not modeled by the statistics. Having a large-N study makes it harder to have that systematic error (but not impossible - as an example: look at complaints about how much psychological and cognitive science research is only on WEIRD subjects - western, educated, industrial, rich, developed).

The other problem, of course, is that one time in a hundred, you'll get a p < 0.01 significant result by chance. Which is a lot in the long run. Worse, you can induce type two errors by running hundreds of trials (or testing hundreds of variables) and not accounting for that - just pick the one thing that had significant results on a single test. This approach is unscrupulous, but not unheard of in academic circles where you need to publish tons of work to get promoted.

> If your p-value is less than 0.01, then there's less than a 1% chance that the pattern you're seeing is due to random fluctuations of the variable itself that you cannot predict.

This is a dangerous misinterpretation of p values, which cannot provide that kind of information. A p value assumes the pattern is due to random fluctuations, and asks how common this kind of fluctuation is.

Typically the chance the result is a random fluctuation is much higher; for examples, see http://www.statisticsdonewrong.com/p-value.html

That's actually a more articulate, but redundant codicil to the argument I made in the rest of the post. Multiple tests will result in significance at some alpha, since you just have to test enough times to get a lucky test. There are techniques (outlined in your link), for addressing that, but the central point I think is still cogent.

If you have a test of significance that results in p < 0.01, there's a one percent chance that you're rejecting the null hypothesis due to normally-distributed variation in your data. The base rate fallacy is more about interpreting what that p = 0.01 means, and why systematic bias is important to worry about - if you're testing cancer drugs, you don't want to test them on people who don't have cancer.

> If you have a test of significance that results in p < 0.01, there's a one percent chance that you're rejecting the null hypothesis due to normally-distributed variation in your data.

No, this is absolutely not true. If p < 0.01, then if there is no systematic effect and only normally-distributed variation, you would see this effect 1% of the time. That is, the p is P(data | null is true), and not P(null is true | data). You cannot invert the conditional.

In the extreme case, when the null is true for every test, you will get significant results for 5% of them. Thus 100% of your statistically significant results are false positives, no matter how small their p values.

Given that we do not know what fraction of the time the null is true, we cannot know the chance that we're rejecting the null falsely. But it is invariably larger than p.

This misunderstanding is why scientists routinely overestimate the strength of their evidence and discount the possibility that their results may be flukes.

(Source: I wrote the link provided earlier. Also, the discussion leading to table 1 in this paper is good http://journals.plos.org/plosmedicine/article?id=10.1371/jou...)

This isn't correct. The statistical power of n=5 humans is quite low.

It is, however, a good example of the "law of small numbers" of Tversky and Kahneman, a cognitive bias in which people believe that the law of large numbers applies to small numbers as well.

See Tversky and Kahneman 1971, or Kahneman's fantastic recent book Thinking, Fast and Slow which is an excellent guide to how our cognitive biases can wrongly influence our thinking.

> This isn't correct. The statistical power of n=5 humans is quite low.

A few points are important to consider.

First, I was only talking about experimental studies searching causal relationships. There are other possible designs, for example field studies (e.g. "school district A gets the new math curriculum, school district B the old one. Which one fares better?") or simple population observations ("people playing golf live longer than the average population."). Each design has advantages and disadvantages regarding generality of the statement one can make, and for each one different statistical considerations apply.

Second, the statistical power does not rely on a high population alone, as that (more or less) only affects the significance tests. Much more important is the effect size. If you can measure a large effect (as this study did), it's pretty hard not to reach significance anyway.

Third, from a statistical point of view, the population isn't 5, but much higher.

Let me explain: There are certain kinds of treatments whose effect is reversable. Caffeine intake is an good example: Once you stop taking caffeine, the effect recedes. While designing the study, you can use that property. One common way is an ABAB design, where A is a phase with treatment and B is a phase without. You can chain as much AB pairs as time permits, and additionally you can measure multiple times per phase. Statistically, the population now is real_humans x number_of_phases x measure_points_per_phase.

If the 5 subjects capture all the relevant differences among humans then sure. But, for example, what if they inadvertently selected only heavy coffee drinkers? That being said, I think the strength of their paper might be the molecular experiments they did on cultured cells. The media articles lead with the human experiments because that might be more relatable.
There's the problem that with 5 events you can't know if it was a fluke and also the subpopulation that you are sampling from.

I think for instance how the drug Naltrexone seems to work very well for treating alcoholism in Asians and poorly in Blacks. If you don't take this into account whatever result you get is going to indicate that the drug is too effective or not effective enough.

In the part that can be accessed (2nd link) they mention it's a within-subject setup so yeah "with all 5 subjects simultaneously" is correct. within-subject basically means each subject gets all the treatments as opposed to between-subject which is the typical A/B-test setup.
I'm quite aware of P-values :) I neither mentioned posterior calculation of probabilites nor talked about correlation studies, so I'm not seeing your point in linking to Gelman's article?