| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by civilized 1728 days ago
	"You have to oversample your minority class so your data is balanced" is an urban legend that needs to die. It is never necessary, unless you are using extremely outdated model fitting methods, and even then, it would only be needed in training. It is completely unnecessary in evaluation, and metrics on biased data are going to be biased (obviously!). If I hear this in an interview, I'm going to assume you do data science by blindly copying random blog posts.

2 comments

aabaker99 1728 days ago

While I agree with your overall sentiment, it is important to understand the relationship between class (im)balance and ROC curve performance. A very short article which does a great job explaining this is [0]. There are, of course, other performance metrics that are appropriate in the presence of class imbalance, such as precision-recall curves, so I wouldn't go so far as to say "metrics on biased data are going to be biased". Some metrics can correct for the class imbalance bias. Others can't.

[0] https://www.researchgate.net/profile/Jake-Lever/publication/...

link

civilized 1728 days ago

What do you mean by class imbalance bias?

link

cweill 1728 days ago

Agreed. ROC AUC is fairly robust to over or undersamping a class in determining if your classifier is predictive.

link

aabaker99 1728 days ago

See my sibling comment. Suppose you have data with 0.95 positive class and 0.05 negative class. You can achieve high ROC AUC with the classifier that blindly predicts the positive class. It may be "predictive" (after all 0.95 of the data is positive), but I would hesitate to praise such a classifier.

link