Hacker News new | ask | show | jobs
by shadowgovt 1744 days ago
One major risk source I see is that the size of the training data for the races isn't the same. For white vs. black patient data, there's between a 2:1 and 3:1 ratio bias in both the training and test data (and a much higher ratio bias for Asian... as high as 20:1 in some of these categories).

This gives the CNN more information on one race than another, which can create a classifier that performs very well on the training and test data it has access to but then flakes spectacularly on data outside the training set (because the source isn't representative of the total variance in the global population).

1 comments

They tested on tons of different external datasets, and at least one of the training datasets was balanced. Same results were obtained.