Hacker News new | ask | show | jobs
by agentcoops 3571 days ago
Lives of members of social groups can be different for historical reasons. Because of this, the best selection of features (as in, how we select and encode relevant aspects of a dataset for a particular problem) that one would need to use, as well as the correlations between them, may be different between different groups. The question is not whether there is fundamentally a difference between, in this case, racial groups vis-a-vis paying back loans (i.e. that the only feature required in your model would be "is member of this group"), but what are the traits of possibilities of life that have for whatever reason ended up leaving a quantifiable trace in our databases and their distribution within that group.

One hypothetical example: suppose that there existed a group G that was not able to go to the top n% of universities due to discrimination. Your company uses some rank of university attended as one of the features input to its favorite machine learning algorithm. However, the dataset you trained on excluded group G. Within this group, the best university individuals have been able to attend is X which is by definition not in the top n%. Had the algorithm been trained on this group it would have observed that school X is highly correlated with success in this group, even if not in the original training set used. As is, your ML system assigns a low probability to members of group G.

Issues like this will be hard to prevent. While that doesn't mean we shouldn't work hard to make real innovations in ML, I think the legal approach of a "right to explanation" as analyzed in http://arxiv.org/pdf/1606.08813v3.pdf and recently added to European law is regardless a helpful tool to ensure accountability.

1 comments

Yes, if your training data excludes relevant features then you can't use them. No one disputes this.

However, once you start including such people in your training data, these issues are not hard to prevent. In fact, ML systems will often do this accidentally even when you don't want them to (when the sign of the bias has the politically incorrect direction). It's called redundant encoding.

See the section of my blog post "What if we scrub race, but redundantly encode it?" where I do calculations to show the effect of this: https://www.chrisstucchio.com/blog/2016/alien_intelligences_...

In short, if your data is biased against a group, but you include group membership either directly or via redundant encoding, your algorithm will fix the bias as best it can.

The entire purpose of machine learning is to discover hidden features and correlations in messy data, so I fail to see why this is considered surprising.

I generally consider the "right to explanation" to be a fairly transparent attempt by the EU to keep American tech companies out of Europe. The entire purpose of ML is that it can uncover true facts that humans can't. The right to explanation is just an attempt to hobble this power, probably because few Euro companies can do it.