|
|
|
|
|
by habaryu
3378 days ago
|
|
What does this paragraph even mean? "On the same note, though perfectly correlated variables are redundant and might not add value to the model (and if removed, would be computationally efficient), a high-variable correlation could have additional information to add. In other words, two variables that are not correlated could still be of importance to the model. When in doubt, it’s safer to train the model and observe it’s performance." The "In other words" part seem to talk about uncorrelated variables while the sentence before that talked about highly correlated variables. I always thought to discard one of two variables that were highly correlated. |
|
The bit about "it's safer to train the model and observe its performance", though, is reasonable advice. The main problem with multicollinearity is that it gives you a model with poorly-defined coefficients - that is, their standard errors are high. That's a big problem if you're coming at the problem with a statistician's mindset and trying to come up with a parsimonious model with statistically significant parameter estimates. If you're just going for the best predictive model you can get, though, then you don't necessarily care about super tight standard errors on all your coefficients, so it's not such a big deal.