| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by eklitzke 1493 days ago

Let's say you're trying to train an model to predict if a patient has a cancerous tumor based on some imaging data. You have a data set for this that includes images from people with tumors and people without, from all races. However, unbeknownst to you, most of the images from people of race X had tumors and most of the images from people of race Y did not have tumors.

If the AI is also implicitly learning to detect race from the images, it's going to learn an association that people of race X usually have tumors and people of race Y usually do not.

The problem here is that the people training the model and the clinical radiologists interpreting data from the model may not realize that race was a confounding factor in training, so they'll be unaware that the model may make racial inferences in the real world data.

If people of race X really do have a higher incidence rate for a specific type of cancer than race Y, maybe this is OK. But if the issue is that there was bias in the training/validation data set that was unknown to the people building the model, and in the real world people of race X and race Y have exactly the same incidence rate for this type of cancer, then this is going to be a problem because it's likely to introduce race-specific errors.