Hacker News new | ask | show | jobs
by gruez 484 days ago
Is it a "statistical bias" if it reflects the underlying data? Is it "bias" to generate mostly male lumberjacks, even though most are male?
1 comments

Yes. The term of art for this is "demographic bias" and it's exactly what you describe -- the population set has itself a skew for or against some demographic.

An ML image generator designed to repaint someone as a lumberjack should work equally well for all users, no matter the actual real world demographics. So the training dataset needs to account for this demographic bias if it wants to not overfit.

This isn't some recent "woke" phenomena, this has been known about large ML projects for at least a decade, if not longer.

If you are training a model to respond on automated test failures, you don't want to sample real world test data in proportion to automated test results, because most automated tests pass. This is also demographic bias and needs to be handled depending on what you want the model to learn.