Hacker News new | ask | show | jobs
by claytonjy 2611 days ago
Why would you ever balance your test data? If 80/20 is the actual population distribution, the sample that forms your test set should conform to that. Balance all you want in train/validation sets, but never the test set.

Not balancing and using ROC is a terrible combo, but the metric is the problem, not the lack of artificial balance.

4 comments

I agree, they should do one or the other.

The imbalance is totally artificial and objectionable though. Where's the evidence that doctors see a 80/20 split in real life? If there is going to be an imbalance they should make it reflect the actual statistics of the task that the doctors perform not some artificial number. It doesn't even reflect the statistics of the dataset they started with (which is 90/10 unblanaced).

Admittedly, the correct analysis for when the data is unbalanced is more annoying and ROC curves are easier to interpret. That's why in something like ImageNet even though the training set is imbalanced, the test set is is balanced.

Comparisons against humans are also harder when the data is imbalanced in a way that reflects the training set, not the task. Humans don't know they are supposed to say "no" 80% of the time. That rewards the machine and that isn't easy to correct (you can correct what you think about the machine results with respect to a baseline, but not what biases the humans had).

> Where's the evidence that doctors see a 80/20 split in real life?

Cause they definitely don’t. Even in a select subpopulation - say, people going to a derm for screening - you’d expect one melanoma per 620 persons screened (as per the SCREEN trial). Since most people have more than one mole for evaluation, and even those with melanoma will have multiple innocent moles... a mole count >50 triggers a referral for screening, though in more cautious docs, possibly as few as 25...

If you wanna be really generous and consider our hypothetical high risk group to have an average of 10 moles per person, that’s 6209:1, not 80:20.

Another reason to balance the test set when the train set is unbalanced is to check if lack of training data for certain classes is a problem. You would use cross-validation, but do different splits for each class. It might well turn out that certain classes are just "easy", and you don't need to find more training samples for them to get the overall accuracy up.
80/20 is not the actual population distribution though.
Do you have an explanation of why ROC is bad for unbalanced datasets? Isn't ROC unaffected by dataset imbalance?
Agreed, I have a hard time believing this person does CV research (though I suppose it could just be a hobby for them) with a statement like that. Especially calling out that they didn't balance the test set, ummm... what?