Hacker News new | ask | show | jobs
by mwexler 977 days ago
I 100% agree with this blind spot. Most data science coursework avoids the very thing making it a science: the explanation of what change causes what effect. I've been surprised that year after year, programs at so many "Schools of Data Science" keep gliding over this area, perhaps alluding to it in an early stats course if at all.

It's an important part of validating that your data-driven output or decision is actually creating the change you hope for. So many fields either do poor experimentation or none at all, others are prevented from doing the usual "full unrestricted RCT": med and fin svcs and other regulated industries have legal constraints on what they can experiment with; in other cases, data privacy restricts the measures one can take.

I've had many data folks throw up their hands if they can't do a full RCT, and instead look to pre-post with lots of methodological errors. You can guess how many of those projects end up. (No, not every change needs a full test, and some things are easy rollback. But think of how many others would have benefitted from some uncertainty reduction.)

Sure, "LLM everything" and "just gbm it!" and "ok, just need a new feature table and I'm done!" are all important and fun parts of a data science day. But if I can't show that a data driven decision or output makes things better, then it's just noise.

Causal modeling gets us there. It improves the impact of ml models that recognize the power of causal interventions, and it gives us evidence that we are helping (or harming).

It's (IMO) necessary, but of course, not sufficient. Lots of other great things are done by ML eng and data scientists and data eng and the rest, having nothing to do with casual inference... But I keep thinking how much better things get when we apply a causal lens to our work.

(And next on my list would be having more data folks understanding slowly changing dimension tables, but this can wait for another time).

1 comments

I realize this is nitpicking a minor point in your comment, but I don't agree with your characterization of RCTs in medical research as being primarily constrained by laws and regulations. Any time I've discussed research on human subjects with doctors doing that research, the discussion of what is and is not an acceptable experiment has always been primarily driven by the risks of harm to the people involved in the study. Any time the law comes up, it's usually because the law requires an RCT in a specific setting, as opposed to preventing it (e.g. drug trials). (Of course in the setting of starting a company based on some medical product, the situation may be quite different.)

Biologists, if not data scientists, are used to considering indirect evidence for causality. It's why we sometimes accept studies performed in other organisms as evidence for biology in humans; it's why we sometimes accept research performed on post-mortem human tissue as being representative of the biology of living humans; to name but a few examples. A big part of a compelling high-impact biology (or bioinformatics) paper is often the innovative ways that one comes up to show causality when a direct RCT is not feasible, and papers are frequently rejected because they don't to the follow-up experiments required to show causality.

That's a very fair point. I didn't mean to suggest that harm to the patients or subjects was not the overriding factor, nor that bio, pharma, and other medical fields never do RCTs.

But there are a slew of laws and requirements around _how_ to run an RCT across the world of bio-related work, esp as it becomes a product. From marketing to manufacture to packaging, there are strict limits around where variation is allowed, at least anything involving the FDA in the US. (Some would say too many regs, others say not enough).

And in those cases, having a wider collection of ways to impute cause would be great.

Yes, that's true, legal requirements definitely become much more of a factor the closer you get from research to product.