| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by wrsh07 3551 days ago

This is very optimistic. There are well known and documented cases of ml algorithm bias and its cause [1].

And it's not surprising that data itself contains some biases from the humans creating it. Suppose police are asking machine learning where more crime is committed - there will be a feedback loop. Where are they currently making more arrests? If they spend more time there, the bias will be exaggerated.

The op correctly gauges how we should be cautious. Your post, I'm afraid, is misleading at best.

[1] https://www.google.com/amp/s/www.technologyreview.com/s/6017...

1 comments

yummyfajitas 3551 days ago

Of course data contains biases. But again, please read the article I linked; algorithms will have a tendency to correct that bias.

The examples in the article you link to are not algorithmic bias at all. They consist of:

1) Humans at Facebook manipulating trending results.

2) Google's keyword algorithm (accurately) reflecting the fact that people with black names are more likely to have arrest records.

Lets distinguish "bias" from "accurately learning things you wish it wouldn't learn" or "accurately learning things you wish weren't true."

None of what I'm saying is remotely controversial. If I told you statistics could detect and correct bias in a mobile phone compass, you'd just think "cool stats bro". Is this article remotely controversial? https://www.chrisstucchio.com/blog/2016/bayesian_calibration...

The specific feedback loop you describe - variable detection probability => variable # of detections - can be directly mitigated. For a non-controversial example drawn from sensor networks (sensors report events with a delayed eraction, the longer you wait the more events you detect), see here: https://www.chrisstucchio.com/blog/2016/delayed_reactions.ht...

(You can find similar examples all over the place. I just link to the ones I wrote because they spring immediately to mind.)

In a compass, a sensor network, adtech or other quant finance, the idea that machine learning can fix biased inputs is not remotely controversial. The concept that statistics suddenly stops working to fix racism is just silly anthropomorphism.

link

wrsh07 3551 days ago

Aha - I think I see our miscommunication. When you say bias you mean statistical bias.

Yes, machine learning is able to correct for that kind of bias - 538's polls forecast is a good example of that.

But you don't get to redefine racial bias to be something innocuous. Yes, black names are more likely to have arrest records, but that "fact" is super misleading [1].

Finally, you're talking past me. I'm not saying that statistics is broken. I'm saying that we should be especially mindful of the OPs point when they say this:

> So what’s your data being fried in? These algorithms train on large collections that you know nothing about. Sites like Google operate on a scale hundreds of times bigger than anything in the humanities. Any irregularities in that training data end up infused into in the classifier.

I think the OP author also has a related post about the kind of bias I'm talking about: http://idlewords.com/talks/sase_panel.htm

[1]: http://www.huffingtonpost.com/kim-farbota/black-crime-rates-...

link

yummyfajitas 3551 days ago

Without getting into a dispute about the definition of "bias", I'm saying that algorithms can accurately measure reality even if input(x=white, all else equal) != input(x=black, all else equal).

You are saying that algorithms are accurately measuring a reality you wish were different. I don't disagree with this.

The right thing to do is to actually answer unpleasant moral questions like "if blacks are 4x more likely to be dangerous criminals, what should we do about it?" But I guess overloading the word "bias" is a nice substitute for clearly thinking things through.

link

eridius 3551 days ago

The problem is you're modeling a biased reality. And accurately modeling a biased reality may in many cases accentuate the bias. Take for example the previously-mentioned case of using an algorithm to determine where to focus your policing efforts. If the data you have says that more arrests are done in a particular part of the city, then you'll want to put more police there, right? But areas where there are more police will tend to see more arrests. So the fact that you're putting more police in an area where you see more arrests is just going to make the bias more extreme, causing even more arrests there. This causes a feedback loop. So you may be accurately modeling reality, but you're modeling a pre-existing bias and making it worse. And who knows why that pre-existing bias was even there? The fact that there were more arrests there may not be because that area actually has more crime committed, it could be due to other factors, such as racial profiling by police, and in that case your algorithm is now accidentally racist because it's perpetuating racial profiling.

link

dragonwriter 3551 days ago

The problems are really twofold:

(1) Defining the proper goals, and

(2) Measuring the right things (such as the real goals of interest rather than biased proxies.)

With police deployments, you are assuming the solution (rather than letting your algorithm optimize it) by saying "I want to put more police where more arrests occur". What you really want is probably something more like (the exact goal may be different, of course) "I want to deploy police resources where it will most effectively reduce the incidence of crime, weighted by some assigned measure of severity." Then let your ML algorithm crunch the various measurable factors and produce an optimum deployment to do that.

(But, then again with that goal -- and similar problems exist with many likely real goals -- you run into the other problem, which is measuring the incidence of crime -- measuring crime reports may be the obvious approach, but there's plenty of evidence that lots of factors can bias crime reports, including communities having bad experience with police being less likely to report crimes.)

link

wrsh07 3551 days ago

Thank you. This is so much clearer than what I was saying.

As you say, proper goals and measurement can fix a lot of these problems, and I don't think it's obvious that ml algorithms solve either of those

link

yummyfajitas 3551 days ago

I directly addressed this critique two posts up. Why don't you go read that post?

https://news.ycombinator.com/item?id=12627359

link

eridius 3551 days ago

I did read it, but you're talking about correcting for measurement biases in order to recover an accurate view of reality. But what I'm saying is that accurately measuring reality may in fact be how you get bias, because the very thing you're measuring may be biased. If you're aware the bias exists and have tools that can measure the bias itself then maybe you can correct for the bias, but you can't just expect your algorithm to automatically correct itself in the presence of bias because its goal is to model reality, not to figure out whether there's inherent bias in the thing it's modeling.

link

srean 3551 days ago

Are you saying that it can form a good estimate of the conditional probability ? I can believe that if the sampling process preserves the conditional.

Otherwise one would have to make assumptions about (or in other words, model) the corruption process. The bias compensation machinery then has to be deliberate, wont happen on its own.

Some sampling processes do not modify the conditional. In those cases no special machinery would be required.

link

yummyfajitas 3551 days ago

tOne approach is to directly model the corruption process. Being the model-based-Bayesian guy I am, this is something I like to do.

But if your model is sufficiently expressive you don't need to explicitly build or model the corruption process. In the example in my linked blog post, test scores might be biased against blacks. But race is also redundantly encoded, so the algorithm has enough information to fix the bias completely by accident.

Fundamentally what I'm saying here is that bias is a statistics problem and has a statistics solution. Insofar as your complaint is algorithms finding the wrong answer, the solution is better stats.

And nothing whatsoever that I've said here would be remotely controversial if the topic were remote sensing.

link

srean 3551 days ago

> But if your model is sufficiently expressive you don't need to explicitly build or model the corruption process

This is the claim that I am having trouble with.

Say I have two random variable X,Y with some joint distribution. If a corruption process can mess with the samples drawn from it, I cannot see how it could possibly recover either the joint or the conditional.

Are you saying that the corruption is benign like missing at random or missing completely at random ? Then its much more believable.

link