| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by perturbation 2603 days ago

> * They measure the wrong things that reward the network. Because the dataset is imbalanced you can't use an ROC curve, sensitivity, or specificity. You need to use precision and recall and make a PR curve. This is machine learning and stats 101.

A̶F̶A̶I̶K̶,̶ ̶a̶ ̶R̶O̶C̶ ̶c̶u̶r̶v̶e̶ ̶c̶a̶n̶ ̶b̶e̶ ̶m̶i̶s̶l̶e̶a̶d̶i̶n̶g̶ ̶f̶o̶r̶ ̶a̶n̶ ̶i̶m̶b̶a̶l̶a̶n̶c̶e̶d̶ ̶d̶a̶t̶a̶s̶e̶t̶,̶ ̶b̶u̶t̶ ̶t̶h̶e̶ ̶A̶U̶C̶ ̶i̶s̶ ̶s̶t̶i̶l̶l̶ ̶o̶k̶a̶y̶ ̶f̶o̶r̶ ̶s̶e̶l̶e̶c̶t̶i̶n̶g̶ ̶m̶o̶d̶e̶l̶s̶.̶ Edit: This is incorrect, a PR curve + PR AUC should be used for model selection if imbalanced. I agree it would be really misleading if they (say) just reported accuracy (since the null classifier of always guess negative would give 80% overall accuracy). I̶ ̶t̶h̶o̶u̶g̶h̶t̶ ̶t̶h̶a̶t̶ ̶t̶h̶e̶ ̶A̶U̶C̶ ̶f̶o̶r̶ ̶R̶O̶C̶ ̶c̶u̶r̶v̶e̶ ̶s̶h̶o̶u̶l̶d̶ ̶s̶t̶i̶l̶l̶ ̶b̶e̶ ̶a̶ ̶v̶a̶l̶i̶d̶ ̶m̶e̶a̶s̶u̶r̶e̶ ̶s̶i̶n̶c̶e̶ ̶i̶t̶'̶s̶ ̶s̶h̶o̶w̶i̶n̶g̶ ̶h̶o̶w̶ ̶m̶u̶c̶h̶ ̶b̶e̶t̶t̶e̶r̶ ̶t̶h̶e̶ ̶m̶o̶d̶e̶l̶ ̶p̶e̶r̶f̶o̶r̶m̶s̶ ̶t̶h̶a̶n̶ ̶r̶a̶n̶d̶o̶m̶ ̶g̶u̶e̶s̶s̶i̶n̶g̶.̶

How do you usually handle imbalanced data? I've had some success with SMOTE or weighted loss for imbalanced datasets, but I'm embarrassed to say I've been using AUC with ROC curves as the default - if this gives inferior model selection than AUC with PR curve I'll have to start doing that instead.