Hacker News new | ask | show | jobs
by gcmac 3099 days ago
I disagree with the analysis of this article. In a typical machine learning process, the response variable stays the same (at a distributional level) but you cycle through candidate models. So regardless of whatever the class distributions are, a higher AUC score indicates a better model.

It might be true that the classifier performance is worse on an imbalanced data set (with the same AUC score) than a balanced one but that just reflects the fact that classifiers are harder to build for imbalanced data.

2 comments

Having a better AUC score does not guarantee a better AUPR score. So a model with better AUC is not universally "better"

See http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.98....

No, the point is for an imbalanced set you literally don't care about the model performance where the false-positive-rate is substantial. IE, let's say you have 1% true-hits in your data and run the classifier at an FPR of 5%, that means you are generating ~5 false-positives which is insane to do!

That's why most of the ROC curve is useless for imbalanced sets. That's why I prefer precision/recall graphs as does the OP.