Hacker News new | ask | show | jobs
by MichailP 2989 days ago
Now this is a topic I desperately need. Can anyone here by any chance explain why would one choose predictors in multilinear regression that are NOT correlated to the target? I am having trouble understanding paper [1] where authors avoid using predictors that are correlated to target. Target is ozone concentration shown by referent instrument and predictors are low cost sensor outputs.

[1] https://www.sciencedirect.com/science/article/pii/S092540051... Section 4.1 about ozone predictors

2 comments

The issue is intra-predictor correlation. In the extreme case that a predictor is duplicated, the correct beta might be {betaa, beta(1-a)} for a in [0, 1], which an algorithm may not estimate in a stable manner. A significant degree of correlation introduces this general problem.
... or worse; it is still true for any a. You could easily get {1,000,001, -1,000,000}, which for perfectly clean, precise, representable data is equivalent, but which magnifies any noise/error in one of the predictors by a million. or a billion.
So say you have 3 predictors that have high intra predictor correlation. Can you still pick one of them, and discard the remaning 2? Or you cant pick any one of them?
Using ridge regression (mentioned in TFA) would prefer a (1/3,1/3,1/3) average of those predictors (or a better combination, depending on their respective noises).

Using lasso (also mentioned in TFA) would prefer to pick the best of the three and drop the others.

Using elastic net would be a combination of both.

Note, though, that any method other than simple regression has tuning parameters -- depending on those, you could still end with result equivalent to plain least squares.

You can, but why trash information that is present when you can leverage it with a different approach?
Like PCA? But that way you loose physical meaning of the predictors.
PCA is a special case of factor analysis, so you are representing them as observations of a latent variable (which is often a narrative people use when explaining why two x variables are correlated)
When predictors are correlated with each other you get multicollinearity potentially leading to incorrect statistical inferences.
Thanks for the answer. And what is the correct approach here, if you can only chose/not chose predictor in final set? Discard all multicollinear predictors or pick just one of them?
Keeping just to linear regression. If those variables are measuring the same construct, pick the best one or use a method to combine their scores. If they measure different constructs but are very correlated, then you'd need to drop one..depending on the variance inflation factor...which you can test for.

As the article mentions however, there are regression methods meant for these situations (e.g. ridge regression).

One thing that should be mentioned though is in the case of polynomials e.g. y ~ x + x^2, there will be a lot of multicollinearity between these terms, but that multicollinearity is OK...just be sure to center your variables.
Wrong.

Wrong, wrong, wrong, wrong.

If predictors are linearly dependent you don't get to do regression at all -- your X'X is singular. But then, the extra regressors add no information at all, and classical statistical packages (SPSS, Stata, etc.) drop them automatically.

Even if predictors are highly correlated, the OLS estimator is unbiased. This is the stuff of elementary statistics. You just get lower and lower p-values/wider and wider CIs, specially if your samples are econometrics-sized.

---

You people need to watch some Khan Academy or whatever the cool kids are doing now to learn maths.

There is no need to be rude or yell.

Yes, if your variables are perfectly linearly dependent they get dropped. Did anyone say otherwise? I did not think about this case because most correlated measures causing multicollinearity problems aren't perfectly 'linearly dependent'. Linearly dependency usually only comes up practically if you miscoded some of your independent dummy variables (e.g. adding both 'male[0,1]' and 'not male[0,1]' as two categorical predictors). So I am not really sure of your point.

As to your second point, it might be unbiased but the statistical inference (i.e. p-value) would be incorrect with multi-collinearity..thus again, I am not sure of your point when you are only repeating what I said.

Moreover, it may not be particularly meaningful to the researcher even if the parameter estimate is unbiased. One frequently finds with multicollinearity that the signs of effects will switch (- to +, or + to -) as you add highly correlated predictors into a model, in oft-theoretically questionable ways, but does serve to remind one that the parameter estimates are only meaningful in the context of the other predictors in the model.

There's this other thing called the FWL theorem.

As long as the unexplained term is uncorrelated (in the probabilistic model; linear regression will force this to be the case computationally) with the included variables, your coefficients will remain unchanged. So adding/removing variables shouldn't change results at all -- unless the model is mis-specified and you're including variables that correlate with unobserved factors in unexpected ways.

So for example a regression of children's IQ on the income of their parents provides a plausible mechanism; but if you add the arm length of the kids you will have problems, since arm length is correlated to an omitted variable (kids with longer arms are older and perform better on IQ tests).

That's most of the "in context" story. Nothing to do with multicollinearity.

Thanks for the thoughtful comment and reference.

The 'in context' was not so much about multicollinearity but about shared and unique variance.