|
|
|
|
|
by claytonjy
2611 days ago
|
|
Why would you ever balance your test data? If 80/20 is the actual population distribution, the sample that forms your test set should conform to that. Balance all you want in train/validation sets, but never the test set. Not balancing and using ROC is a terrible combo, but the metric is the problem, not the lack of artificial balance. |
|
The imbalance is totally artificial and objectionable though. Where's the evidence that doctors see a 80/20 split in real life? If there is going to be an imbalance they should make it reflect the actual statistics of the task that the doctors perform not some artificial number. It doesn't even reflect the statistics of the dataset they started with (which is 90/10 unblanaced).
Admittedly, the correct analysis for when the data is unbalanced is more annoying and ROC curves are easier to interpret. That's why in something like ImageNet even though the training set is imbalanced, the test set is is balanced.
Comparisons against humans are also harder when the data is imbalanced in a way that reflects the training set, not the task. Humans don't know they are supposed to say "no" 80% of the time. That rewards the machine and that isn't easy to correct (you can correct what you think about the machine results with respect to a baseline, but not what biases the humans had).