Hacker News new | ask | show | jobs
by sopooneo 2671 days ago
In simples case at least, such as with the kidney stones, can we reduce our risk of reaching wrong conclusions by increasing our sample size of patients and randomizing which receive each treatment?
2 comments

Yes, but it won't help with other problems like measuring the wrong metric.

For example, the YouTube latency example linked at the bottom was a randomized A/B test ("launched an opt-in to a fraction of our traffic"), but it was measuring per-user latency metrics when the distribution of 'user' had changed radically thanks to the improvements; for this, he would've needed to instead be monitoring some more global long-term effect like user retention or total traffic (then he would've seen a result like 'latency got a lot worse, but we're getting a ton more users and they're coming back much more frequently, so, that's good overall but why is latency up and who are all these new users...? aha!'). You have a Simpson's paradox on the level of metrics here, instead of individuals.

Yes absolutely! Random assigment along with statistical power and significance considerations does indeed allow one to draw causal conclusions. It’s the gold standard for causal inference.
> Yes absolutely!

The problem with these cases is generally that people want to use data that didn't come from a controlled experiment to begin with. You have a nice, fat data set of all the people who have been treated for kidney stones -- you could never afford to do a controlled experiment at that scale. But because the treatments weren't randomized (and neither was anything else), the conclusions are erroneous.

This has been a huge problem in social sciences, where you can't do the controlled experiment at all, even at a smaller scale, because there is no way to randomize the choices individuals make. All you can do is try to control for the divergence statistically -- but there isn't one confounder in real data, there are thousands or more, and each one you want to control for multiplies the measurement error (because the measurement error in the primary factor combines with the measurement error in the control factor).

You're right, and in some instances it is possible to draw causal conclusions from observational data. See [0] and [1] for two pretty different perspectives. But for this to work, you need a lot of data: both lots of units (e.g. people), and a lot of information about each individual unit.

[0] Causality, Judea Pearl

[1] Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction, Guido Imbens and Donald Rubin

The trouble is you can't fix large numbers of statistical confounders with more data because there is a limit for how many factors you can control for before the measurement error overwhelms the signal.

To do statistical controls, you essentially sort the data by category, so that you're not just comparing black people with white people, you're comparing middle class 18 year old black female college applicants with college educated parents to middle class 18 year old white female college applicants with college educated parents.

But every one of those factors is a chance to have measured something wrong. Your group of middle class 18 year old black female college applicants with college educated parents will have a couple of people who were misidentified as middle class, a couple of people who were misidentified as black, a couple of people who were misidentified as female, a couple of people who were misidentified as 18 and a couple of people who were misidentified as having college educated parents. And they don't cancel out exactly because the original correlations with the primary factor existed to begin with, so the measurement error compounds in proportion to the strength of the correlation of the primary factor with each confounder.

Meanwhile the size of each subcategory shrinks each time you bisect it further. So the more things you try to control for, the higher the percentage of the sample in each subcategory is measurement error.

I hate to take “both sides” but in the absence of confounding by indication, you can often use propensity scoring within robust models to decrease these impacts.

Mind you, the problem with non random and undetected sampling bias is that it can be subtle. See for example https://www.nytimes.com/2018/08/06/upshot/employer-wellness-...

Propensity scoring is a method of applying statistical controls. How does it address the issue of controls compounding measurement error?
Even if you had infinite data, you are not allowed to just control for everything you measured. You still need to bring in your causal knowledge. E.g. you probably shouldn't control for body weight if it was measured a month after the treatment.

The point is, if you use your causal knowledge in a smart way, you can also draw strong conclusions from just observational data.

Lots of practical challenges for sure!
I <3 this reply. So, so good. Sneaky way of introducing the RCT.

You’re doing it right.