| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by SubiculumCode 2989 days ago
	When predictors are correlated with each other you get multicollinearity potentially leading to incorrect statistical inferences.

2 comments

MichailP 2989 days ago

Thanks for the answer. And what is the correct approach here, if you can only chose/not chose predictor in final set? Discard all multicollinear predictors or pick just one of them?

link

SubiculumCode 2989 days ago

Keeping just to linear regression. If those variables are measuring the same construct, pick the best one or use a method to combine their scores. If they measure different constructs but are very correlated, then you'd need to drop one..depending on the variance inflation factor...which you can test for.

As the article mentions however, there are regression methods meant for these situations (e.g. ridge regression).

link

SubiculumCode 2989 days ago

One thing that should be mentioned though is in the case of polynomials e.g. y ~ x + x^2, there will be a lot of multicollinearity between these terms, but that multicollinearity is OK...just be sure to center your variables.

link

thanatropism 2989 days ago

Wrong.

Wrong, wrong, wrong, wrong.

If predictors are linearly dependent you don't get to do regression at all -- your X'X is singular. But then, the extra regressors add no information at all, and classical statistical packages (SPSS, Stata, etc.) drop them automatically.

Even if predictors are highly correlated, the OLS estimator is unbiased. This is the stuff of elementary statistics. You just get lower and lower p-values/wider and wider CIs, specially if your samples are econometrics-sized.

---

You people need to watch some Khan Academy or whatever the cool kids are doing now to learn maths.

link

SubiculumCode 2988 days ago

There is no need to be rude or yell.

Yes, if your variables are perfectly linearly dependent they get dropped. Did anyone say otherwise? I did not think about this case because most correlated measures causing multicollinearity problems aren't perfectly 'linearly dependent'. Linearly dependency usually only comes up practically if you miscoded some of your independent dummy variables (e.g. adding both 'male[0,1]' and 'not male[0,1]' as two categorical predictors). So I am not really sure of your point.

As to your second point, it might be unbiased but the statistical inference (i.e. p-value) would be incorrect with multi-collinearity..thus again, I am not sure of your point when you are only repeating what I said.

Moreover, it may not be particularly meaningful to the researcher even if the parameter estimate is unbiased. One frequently finds with multicollinearity that the signs of effects will switch (- to +, or + to -) as you add highly correlated predictors into a model, in oft-theoretically questionable ways, but does serve to remind one that the parameter estimates are only meaningful in the context of the other predictors in the model.

link

syntaxfree 2988 days ago

There's this other thing called the FWL theorem.

As long as the unexplained term is uncorrelated (in the probabilistic model; linear regression will force this to be the case computationally) with the included variables, your coefficients will remain unchanged. So adding/removing variables shouldn't change results at all -- unless the model is mis-specified and you're including variables that correlate with unobserved factors in unexpected ways.

So for example a regression of children's IQ on the income of their parents provides a plausible mechanism; but if you add the arm length of the kids you will have problems, since arm length is correlated to an omitted variable (kids with longer arms are older and perform better on IQ tests).

That's most of the "in context" story. Nothing to do with multicollinearity.

link

SubiculumCode 2988 days ago

Thanks for the thoughtful comment and reference.

The 'in context' was not so much about multicollinearity but about shared and unique variance.

link