Hacker News new | ask | show | jobs
by AnthonyMouse 2671 days ago
The trouble is you can't fix large numbers of statistical confounders with more data because there is a limit for how many factors you can control for before the measurement error overwhelms the signal.

To do statistical controls, you essentially sort the data by category, so that you're not just comparing black people with white people, you're comparing middle class 18 year old black female college applicants with college educated parents to middle class 18 year old white female college applicants with college educated parents.

But every one of those factors is a chance to have measured something wrong. Your group of middle class 18 year old black female college applicants with college educated parents will have a couple of people who were misidentified as middle class, a couple of people who were misidentified as black, a couple of people who were misidentified as female, a couple of people who were misidentified as 18 and a couple of people who were misidentified as having college educated parents. And they don't cancel out exactly because the original correlations with the primary factor existed to begin with, so the measurement error compounds in proportion to the strength of the correlation of the primary factor with each confounder.

Meanwhile the size of each subcategory shrinks each time you bisect it further. So the more things you try to control for, the higher the percentage of the sample in each subcategory is measurement error.

3 comments

I hate to take “both sides” but in the absence of confounding by indication, you can often use propensity scoring within robust models to decrease these impacts.

Mind you, the problem with non random and undetected sampling bias is that it can be subtle. See for example https://www.nytimes.com/2018/08/06/upshot/employer-wellness-...

Propensity scoring is a method of applying statistical controls. How does it address the issue of controls compounding measurement error?
That’s the whole point of doubly robust models. However, in the event of confounding by indication or sampling misspecification, my experience is that nothing can save you.

I am a rather strong proponent of randomized trials for this exact reason. (They can also have sampling bias, but some degree of noise is inevitable)

Even if you had infinite data, you are not allowed to just control for everything you measured. You still need to bring in your causal knowledge. E.g. you probably shouldn't control for body weight if it was measured a month after the treatment.

The point is, if you use your causal knowledge in a smart way, you can also draw strong conclusions from just observational data.

Lots of practical challenges for sure!