Hacker News new | ask | show | jobs
by civilized 1681 days ago
"You have to oversample your minority class so your data is balanced" is an urban legend that needs to die. It is never necessary, unless you are using extremely outdated model fitting methods, and even then, it would only be needed in training. It is completely unnecessary in evaluation, and metrics on biased data are going to be biased (obviously!).

If I hear this in an interview, I'm going to assume you do data science by blindly copying random blog posts.

2 comments

While I agree with your overall sentiment, it is important to understand the relationship between class (im)balance and ROC curve performance. A very short article which does a great job explaining this is [0]. There are, of course, other performance metrics that are appropriate in the presence of class imbalance, such as precision-recall curves, so I wouldn't go so far as to say "metrics on biased data are going to be biased". Some metrics can correct for the class imbalance bias. Others can't.

[0] https://www.researchgate.net/profile/Jake-Lever/publication/...

What do you mean by class imbalance bias?
Agreed. ROC AUC is fairly robust to over or undersamping a class in determining if your classifier is predictive.
See my sibling comment. Suppose you have data with 0.95 positive class and 0.05 negative class. You can achieve high ROC AUC with the classifier that blindly predicts the positive class. It may be "predictive" (after all 0.95 of the data is positive), but I would hesitate to praise such a classifier.