Hacker News new | ask | show | jobs
by mystique 4124 days ago
Good list. I am new to Machine Learning with only ~1 year of real work and sometimes I slip and make one of these mistakes.

I have a question on #7. I have not used the co-efficients to mean feature importance but some times get tempted to use them. How do you explain which factors are the most important factors behind some outcome to non-stat people?

4 comments

I have no professional experience with ML, so I might be missing something obvius here that's part of the industry paradigm.

But the article gives two points why you shouldn't use coefficient values to determine feature importance, which I think are only valid to some extent.

>a) changing the scale of the variable changes the absolute value of the coefficient

and

>(b) if features are multi-collinear, coefficients can shift from one feature to others.

Regarding a), well, that's what standardized coefficients are for.

b) is a bit trickier, but most regression models are based on the assumption of non-collinearity. This is of course a problem with real-world data, because you will quite often find some level of collinearity. That's when you (1) test for this issue and (2) look towards multilevel models.

Point #7 is just referring to the magnitudes (or absolute values) of the coefficients. You can still determine which features are relatively important using the coefficient p-values if those are available. This of course is dependent on the necessary assumptions of the regression method that you are using being satisfied, as otherwise the p-values will be biased.

In terms of explaining this to non-stats people, you might want to avoid explaining the p-values directly to them (as it's very easy for people to get confused about what p-values actually mean), so instead you might simply show them which features are "statistically significant". In other words, try to explain the results in a qualitative way rather than a strictly quantitative one.

All the other comments are great. Just bear in mind that it's important to really understand the mechanics behind each importance measurement. Some can use information gain, some can use the t-test on coefficient, while some use random forest and see if removing a feature makes big impact, etc. They all make different assumptions and the key point is again, understand whether those assumptions applied to your situation.
Some techniques, e.g. random forest, give variable importance indicators for free. If you can test it out, give it a go - don't have to use the random forest as the final model.
You can use correlation or the coefficients of a linear model iff the features on the same scale. Another method is to train a model leaving out each feature once, then you see how much accuracy drops.