Hacker News new | ask | show | jobs
by rocqua 1533 days ago
This is getting off-topic. But it still is.

I'll stipulate that in this scenario the data themselves are not actually biased. Whilst in reality these data often are biased by things like disproportionate policing and heavy punishhment.

Even then, discriminating by government based on race is bad even when a statistical basis for such discrimination exists. What makes "race" problematic to discriminate on is how easy it is to see "race". Or rather, how easily most people classify and distinguish between ethnicities based on how they look.

For an example, lets start with a small difference between blue and green people. blue people are twice as likely to commit crime as green people, with a criminality rate of 0.2% vs 0.1%. If this starts being how you police, if this 2x difference starts guiding decisions, then a lot of innocent people start being disadvantaged. The extra problem is that it is very easy to see if someone is blue or green. So it becomes really easy for a lot of people to start acting based on this 2x difference. This harms all blue people which is disproportionate. It then becomes a lot easier to get to the feedback loop you talked about.

1 comments

Yeah, I agree. That's fair. I've thought about this before - I worked at a bank a couple of years ago, and our CCO (Chief Credit Officer in this instance) wanted to implement a rule, in our credit decisioning models, to decline people whose surnames had 5 or more vowels. It was a naked (and admitted) proxy for Africans, and probably some other 'ethnic' people too[0].

And it made me think: I suspect lots of our ML/NN models function like that. They pick race, or they pick a proxy for race. In situations where the 'ground truth' metric genuinely is racially skewed, it can be hard to tell, and it's just not realistic to demand that people make their models inaccurate for the sake of racial equity.

But it highlights, for me, the unavoidable danger of black-box models. I don't mean some logistic regression or decision tree, because those - while not literally explaining themselves - can be figured out if you have some domain knowledge of the parameters. But the overfitting machines that we call neural nets, well, I suspect this is happening everywhere, at a cost in both equity and also accuracy/reliability. (The probably-apocryphal story of the computer vision model for estimating density of people in a train station, but which ended up just looking at the clock on the wall, comes to mind.)

[0] I remember it exactly because it also would have captured me, incidentally, with my plummy double-barrelled surname - though that's beside the point here.