Hacker News new | ask | show | jobs
by chopete 2927 days ago
>> For a problem to be, in a statistical sense, causally identified there must be some random or as-if random manipulation of treatment.

It would be good to see your ideal example of a casual, for non statistics people to understand your long note better.

1 comments

You take a random sample of 1,000 white men between the ages of 45 and 55, who have lived in New England for at least 10 years, with no known history of heart disease. Your randomly split them in half. You give half of them a supplement to take every day for 12 months, and you give the other half a placebo. If the number of heart attacks in the placebo sample is greater than in the treatment sample, you have some believable evidence that the supplement can help prevent unexpected heart attacks, at least in white men in their 40s and 50s.

The idea is that you've controlled for just about every factor that could affect the rate of unexpected heart attacks, or those factors are evenly distributed throughout both samples because you were careful to sample randomly. Therefore, if there is a difference between the groups, on average, it must be because of the treatment that you introduced to one group and not the other.

I'm hand-waving, of course, and I'm sure there are medical researchers out there who will read my study design and laugh at how badly controlled it is. But that should give you the general picture of one comon method used to perform "causal" analysis.

Great example. In the design you propose, in expectation, you would have an unbiased causal inference. We would probably want to check for pre-treatment balance between groups to make sure that stochastic (chance) imbalance did not emerge even though the process itself is good. I don't know anything about heart attacks so I don't have the subject matter knowledge here, but imagine that smoking causes heart attacks. If that's the case, although your design should not cause the presence of smokers among treated and control units to systematically vary, maybe it did by chance. We'd want to assess balance. Same with any other potential confounders.

Another technique we might use is a blocked (or stratified) random sample. Knowing that there will be both smokers and non-smokers, we recruit two separate samples, and randomize treatment assignment within each. This ensures that smoking status does not predict treatment assignment and guards against some potential threat from overall randomization.

We could also mitigate the imbalance that does exist by doing a matched analysis, where each treated unit is paired with a control unit that looks most like him (some control units are reused). Or we could match on propensity scores. Or we could weight on inverse propensity weights. Or we could weight using covariate balancing. Or...

My point in doing this info dump is to a) back up nerdponx's example, which is great and b) illustrate how there's a lot to learn about how statisticians have taken the problem of causal analysis seriously and developed techniques appropriate for answering causal questions.

People in the CS side of things tend to use Pearl's DAGS for conceptualizing this stuff. I'm in the stats/econ side of things so I use Neyman-Rubin. They're equivalent. Allow me to suggest Rubin and Imbens - Causal Inference for Statistics, Social and Biomedical Sciences as a good textbook that we assign to graduate students learning this stuff. Some of my students tell me the "Causal Inference Mixtape" is popular among people who want less statistical theory and more "what should I do as a practitioner". A virtue of both the resources I just mentioned is that they discuss not just experimental designs but also observational data studies, like the one the original post would have wanted to conduct.