Hacker News new | ask | show | jobs
by prododev 484 days ago
> It's a statistical model, and of course there are more black rappers and white investment bankers

Yes, this is what the author is pointing out - there's a statistical bias in the dataset that is showing in the results.

1 comments

Is it a "statistical bias" if it reflects the underlying data? Is it "bias" to generate mostly male lumberjacks, even though most are male?
Yes. The term of art for this is "demographic bias" and it's exactly what you describe -- the population set has itself a skew for or against some demographic.

An ML image generator designed to repaint someone as a lumberjack should work equally well for all users, no matter the actual real world demographics. So the training dataset needs to account for this demographic bias if it wants to not overfit.

This isn't some recent "woke" phenomena, this has been known about large ML projects for at least a decade, if not longer.

If you are training a model to respond on automated test failures, you don't want to sample real world test data in proportion to automated test results, because most automated tests pass. This is also demographic bias and needs to be handled depending on what you want the model to learn.