|
|
|
|
|
by tomrod
972 days ago
|
|
I've self-learned for a long time in the causal inference space and model evaluation is a concern for me. My biggest concern is falsification of hypotheses. In ML, you have a clear mechanism to check estimation/prediction through holdout approaches. In classical metrics, you have model metrics that can be used to define reasonable rejection regions for hypothesis tests. But causal inference doesn't seem to have this, outside traditional model fit metrics or ML holdout assessment? So the only way a model is deemed acceptable is by prior biases? If my understanding is right, this means that each model has to be hand-crafted, adding significant technical debt to complex systems, and we can't get ahead of the assessment. And yet, it's probably the only way forward for viable AI governance. |
|
To be clear, you can overfit while your validation loss does not decrease. If your train and test data are too similar then no holdout will help you measure generalization. You have to remember that datasets are proxies for the thing you're actually trying to model, they are not the thing you are modeling themselves. You can usually see this when testing on in class but out of train/test distribution data (e.g. data from someone else).
You have to be careful because there are a lot of small and non-obvious things that can fuck up statistics. There's a lot of aggregation "paradoxes" (Simpsons, Berkson's), and all kinds of things that can creep in. This is more perilous the bigger your model too. The story of the Monte Hall problem is a great example of how easy it is to get the wrong answer while it seems like you're doing all the right steps.
For the article, the author is far too handwavy with causal inference. The reason we tend not to do it is because it is fucking hard and it scales poorly. Models like Autoregressive (careful here) and Normalizing Flows can do causal inference (and causal discovery) fwiw (essentially you need explicit density models with tractable densities: referring to Goodfellow's taxonomy). But things get funky as you get a lot of variables because there are indistinguishable causal graphs (see Hyvarien and Pajunen). Then there's also the issues with the types of causalities (see Judea's Ladder) and counterfactual inference is FUCKING HARD but the author just acts like it's no big deal. Then he starts conflating it with weaker forms of causal inference. Correlation is the weakest form of causation, despite our often chanted saying of "correlation does not equate to causation" (which is still true, it's just in the class and the saying is more getting at confounding variable). This very much does not scale. Similarly discovery won't scale as you have to permute so many variables in the graph. The curse of dimensionality hits causal analysis HARD.