Hacker News new | ask | show | jobs
by bravura 1114 days ago
Bias of a data set is when it doesn't reflect the true underlying distribution of nature.

So a face corpus with only white faces doesn't reflect the diversity of faces one encounters in the world.

With that said, unbiasing data is extremely difficult because the true distribution of things is unknown and sometimes subjective. The visual images you would encounter as a human from birth to death growing up in a first world country would be very different from that of a drone's video camera. Are we really sure that imagenet should be K% animals and not K/2% animals? And if you train a machine learning algorithm on every possible image with every possible pixel, it will just learn noise.