Hacker News new | ask | show | jobs
by eridius 3551 days ago
The problem is you're modeling a biased reality. And accurately modeling a biased reality may in many cases accentuate the bias. Take for example the previously-mentioned case of using an algorithm to determine where to focus your policing efforts. If the data you have says that more arrests are done in a particular part of the city, then you'll want to put more police there, right? But areas where there are more police will tend to see more arrests. So the fact that you're putting more police in an area where you see more arrests is just going to make the bias more extreme, causing even more arrests there. This causes a feedback loop. So you may be accurately modeling reality, but you're modeling a pre-existing bias and making it worse. And who knows why that pre-existing bias was even there? The fact that there were more arrests there may not be because that area actually has more crime committed, it could be due to other factors, such as racial profiling by police, and in that case your algorithm is now accidentally racist because it's perpetuating racial profiling.
2 comments

The problems are really twofold:

(1) Defining the proper goals, and

(2) Measuring the right things (such as the real goals of interest rather than biased proxies.)

With police deployments, you are assuming the solution (rather than letting your algorithm optimize it) by saying "I want to put more police where more arrests occur". What you really want is probably something more like (the exact goal may be different, of course) "I want to deploy police resources where it will most effectively reduce the incidence of crime, weighted by some assigned measure of severity." Then let your ML algorithm crunch the various measurable factors and produce an optimum deployment to do that.

(But, then again with that goal -- and similar problems exist with many likely real goals -- you run into the other problem, which is measuring the incidence of crime -- measuring crime reports may be the obvious approach, but there's plenty of evidence that lots of factors can bias crime reports, including communities having bad experience with police being less likely to report crimes.)

Thank you. This is so much clearer than what I was saying.

As you say, proper goals and measurement can fix a lot of these problems, and I don't think it's obvious that ml algorithms solve either of those

I directly addressed this critique two posts up. Why don't you go read that post?

https://news.ycombinator.com/item?id=12627359

I did read it, but you're talking about correcting for measurement biases in order to recover an accurate view of reality. But what I'm saying is that accurately measuring reality may in fact be how you get bias, because the very thing you're measuring may be biased. If you're aware the bias exists and have tools that can measure the bias itself then maybe you can correct for the bias, but you can't just expect your algorithm to automatically correct itself in the presence of bias because its goal is to model reality, not to figure out whether there's inherent bias in the thing it's modeling.
Here's my concrete claim. Let pp = police presence, then P(crime detected) = r(pp).

Measured crime = crimes x r(police presence).

As long as your model is expressive enough to capture r(pp), bias should be detected.

Fundamentally you are making the claim that there are certain types of variable correlations that are just so evil that no statistical model can possibly understand them. That's a very bold claim; it's essentially the claim that science doesn't work.

No, I'm claiming that P(crime detected) != r(pp). More police in an area typically means more crime is detected, but that's not the only factor. If you have two areas with identical police presences and identical actual crime rates (as opposed to reported crime rates), the rate of crime detection (as measured by arrests and whatnot) may be higher in one area due to other factors such as racial bias (not just racial profiling, but also things like police letting white people off with a warning where the equivalent black person would be arrested). So you cannot simply correct for this by accounting for the police presence.

What's more, your data may not even have the necessary info to figure out if there's a bias. For example, what if police are more likely to arrest someone wearing a red shirt than someone wearing any other color shirt? Unless the color of the person's shirt is part of the arrest report, there's no way your statistical model is going to figure out that red shirts affect arrest rate.

Your function r = r(pp, red shirts, race of offender, etc) exists. A model of the form a x r + b x something_else + ... will detect the bias you've described, assuming of course the biasing variable is either present or redundantly encoded in the data set.

We've now established the existence of a statistical model which can detect this bias.

Now, any other model which is capable of expressing your specific r(p) can do the same thing. The entire purpose of fancy models like random forests is that they can express lots of functions while also being reasonably generalizable.

If you want to claim that this bias is much more difficult to encode in an SVM than all the other typical hidden patterns, you need to establish that your specific r(...) is somehow vastly more complicated than all the other things that machine learning models regularly detect. That's a pretty strong claim.

Interestingly, you are now arguing the exact opposite of what most "machine learning is racist" people claim. They typically claim machine learning is racist because algorithms actually learn hidden factors they wish it wouldn't; e.g., a lending algorithm might "redline" blacks who don't pay back their debts. I take it you believe this is highly unlikely, and algorithms can't possibly distinguish between men and women and then show high paying job ads to more men than women?

>Your function r = r(pp, red shirts, race of offender, etc) exists. A model of the form a x r + b x something_else + ... will detect the bias you've described, assuming of course the biasing variable is either present or redundantly encoded in the data set.

No no no. Had to respond to this because this such a common confusion (not to say that you personally have this).

That such a model exists within the class of models being used says absolutely nothing about whether the statistical/ML algorithm will find it with any degree of confidence from a sample. The science is still grappling with the question of how to do model selection. There are two, sort of, equivalent class of methods, regularization (this can be a regularization over the dependency structure too, not just a simple penalty) and prior. Its only when you get those right that you have decent chance of estimating well, from reasonable amount of data.

Short answer: universal approximation property of a class of models says nothing about learnability.

Regarding your last paragraph, there's two different angles here. The "machine learning is racist" angle I think is quite valid, but covers a different topic than what we've been discussing here. To be more specific, there's two different ways in which we can have racist models:

1. The algorithm is biased in a way that reflects reality but does not reflect how we wish it to behave. This is the "machine learning is racist" angle. A lending algorithm might quite rightly think that black people are a higher risk, but this is ethically problematic to act on, because denying loans to black people only serves to compound the social problem (even though it may make financial sense for your bank).

2. What I'm arguing is that we can have racist algorithms due to the fact that the data itself may be biased in a way you're not aware of. To take the red shirt example, something I forgot to say before was that if, say, a fad spreads among the black community of wearing red shirts, then you're going to see an uptick in arrests of black people, but your algorithm won't be able to figure out that this is actually due to arresting red-shirted people, which means it will believe that black people in general are more likely to be arrested.