| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by 0xab 2611 days ago

I agree, they should do one or the other.

The imbalance is totally artificial and objectionable though. Where's the evidence that doctors see a 80/20 split in real life? If there is going to be an imbalance they should make it reflect the actual statistics of the task that the doctors perform not some artificial number. It doesn't even reflect the statistics of the dataset they started with (which is 90/10 unblanaced).

Admittedly, the correct analysis for when the data is unbalanced is more annoying and ROC curves are easier to interpret. That's why in something like ImageNet even though the training set is imbalanced, the test set is is balanced.

Comparisons against humans are also harder when the data is imbalanced in a way that reflects the training set, not the task. Humans don't know they are supposed to say "no" 80% of the time. That rewards the machine and that isn't easy to correct (you can correct what you think about the machine results with respect to a baseline, but not what biases the humans had).

2 comments

arkades 2611 days ago

> Where's the evidence that doctors see a 80/20 split in real life?

Cause they definitely don’t. Even in a select subpopulation - say, people going to a derm for screening - you’d expect one melanoma per 620 persons screened (as per the SCREEN trial). Since most people have more than one mole for evaluation, and even those with melanoma will have multiple innocent moles... a mole count >50 triggers a referral for screening, though in more cautious docs, possibly as few as 25...

If you wanna be really generous and consider our hypothetical high risk group to have an average of 10 moles per person, that’s 6209:1, not 80:20.

link

p1esk 2611 days ago

Another reason to balance the test set when the train set is unbalanced is to check if lack of training data for certain classes is a problem. You would use cross-validation, but do different splits for each class. It might well turn out that certain classes are just "easy", and you don't need to find more training samples for them to get the overall accuracy up.

link