| HN Mirror

We impart our bias on ML algorithms by choosing what data to use to train the AI on.

The problem, I think is one of self-selection.

Consider two hypothetical social networking websites - Friendface and FaceSpace. Friendface's userbase are mostly white users, while FaceSpace catered mostly to urban, black populations. And it would make sense too - you would only join a social network if your friends are on it. If you're white, chances are the majority of your friends are also white. And vice versa.

So Friendface is a lot more active on their ML front. The problem is when Friendface releases their data - because they're more active on the ML front, and ML scientists love to not have to collect their data, what happens is more and more models are trained on the Friendface data and more and more models are being optimized based on Friendface data. Apparent "structural" racism happens. Tumblrinas all pounce on it as if it were the biggest oppressive struggles of their lives.

A very cute thing to imagine in this scenario would be to imagine FaceSpace suddenly got good at NLP, and open sources their statistical language model. Recall that FaceSpace users are more likely to use AAVE in their communication, so what do you think the statistical language model would be?

In the original article, Maciej mentions "going to the community" - using crowd wisdom to handle these sorts of thing, and preferring to use open standards as opposed to silo'd standards (like writing your blog post on facebook... why??!!). While that sounds like a good idea, like I've mentioned in my other comment, it also sounds tiring as hell.

Firms act rationally (more or less)... ML is driven by huge companies with huge datasets. Why would they need to prune external datasets when they could just do their ML research with a few SQL queries?