Hacker News new | ask | show | jobs
by p4wnc6 3605 days ago
One special case of this general effect is non-linear coding error, especially in cases when it ends up being the level of a covariate (i.e. the log of the covariate) that matters for causal inference, or when the covariates are categorical or isotonic.

The paper "Let's Put Garbage Can Regressions and Garbage Can Probits Where They Belong" by Achen [0] is a great discussion about some particular properties of this, and the tacit assumptions used to ignore it.

In that paper, it's demonstrated that with just a tiny bit of coding error in the covariates, you can end up with a fitted regression coefficient that is statistically significant and has the wrong sign -- even when there is no noise whatsoever in the target variable (i.e. you can set up a toy example in which the target variable is synthetically generated as a true linear function of two covariates with positive coefficients, then perform a slight non-linear distortion on one of the covariates, regress the synthetic target variable on the clean covariate and the distorted covariate, and get wildly incorrect coefficients that appear to be statistically significant).

People seem to think these toy example are some kind of alien phenomenon that could never happen with real-world data, but the paper is very explicit in the construction of the example data set. It's not harebrained or contrived, like Anscombe's Quartet or anything -- it's very much a plausible data set.

I think it's not hyperbolic at all to say that results like this more or less conclusively show that naive linear regression cannot be trusted. If you're careful with model validation, using randomized hold out data, lots of diagnostic plotting and sanity checking, then regression is a fine tool. But if you do something shocking like take two different univariate models with the same target, fit their regression coefficients, and then select the model with a more favorable t-stat as "the winner" then you are committing an egregious statistical fallacy that often, in real world situations, is giving you not just an inaccurate answer, but an answer pointing totally in the opposite direction of the truth.

What's frightening to me is that across many industries, even in places like high finance -- where "real money is on the line" -- it is extremely common to see huge business intelligence systems predicated entirely on this type of fallacious statistical approach with regression. Sadly, it's often because the regression approach was historically more tractable and the fallacies weren't as well known. And so as certain people gained more senior positions and sought to retain political control of the business tools that they oversaw, they grasped for convenient fictions like "interpretability" to justify their political choice to shun modern techniques.

[0] < http://www.columbia.edu/~gjw10/achen04.pdf >