Hacker News new | ask | show | jobs
by pcmaffey 3550 days ago
Machine learning does not have less bias than human researchers. It is simply magnified at scale.

And that scale is exactly the state of the internet. There is so much data available to study and understand, that we absolutely need better tools, like machine learning or whatever we want to call it, to help us keep up. Shit's moving faster than our human perception can handle, especially for those who didn't grow up with the internet.

Yes the data analyctic tools we have right now are premature— like fast food to our productized minds— but they will improve rapidly, as our taste for quality improves.

But sure demonizing the things you don't like is one step on the path to learning what's truly valuable.

4 comments

Bias is a pattern-generation process. Machine learning is a pattern-recognition process. Any bias on the part of the (human) data collection, or the (human) training program author, gets spit out as a "pattern", because it is one. The problem is that it gives the illusion of a bias in reality.

My go-to example is machine learning police enforcement direction, often used as a counter to racially biased policing. This works in any city with a historical problem of racial bias in policework. We give the algorithm all the data we have from the last 60 years of policing this city. Patrol schedules, incident records, arrest records... everything. The computer magically tells us where we should focus our efforts. To the police chief who paid for the system, and especially to the media reporting on it, it looks like a computer is making the decisions without bias. Hooray!

Of course, anyone who's ever worked with machine learning can spot the problem. The data set was generated by racially biased policing. That bias will be reflected in all the records: more arrests for race X, more patrols scheduled through their neighborhoods, more incident reports from those areas. So when the algorithm says "increase patrols in this neighborhood," or "look for people who fit this profile," it is simply synthesizing the patterns from 60 years of racial bias. So the police in LA have a real problem: their "unbiased" computer program is telling them that their criminals look like black people, and they should increase patrols in Compton. So they do, and that data only takes the data further from "un-biased" reality. In fact, the police "black box" is only pointing out a history of racially biased policing. We're relabeling it as recommendations for future behavior.

Of course, in reality this bias can be corrected for. I don't know if specific crime-stat software does it, but it's certainly doable. Here's an example where I solve literally the same problem (better measurement => more events detected) in a different scenario. Is it remotely controversial that I can do this for a sensor network?

https://www.chrisstucchio.com/blog/2016/delayed_reactions.ht...

You might also be interested to know that a variety of studies have shown that policing is not particularly biased. Arrest statistics and the like correspond pretty well with NCVS and similar crime victim surveys.

http://slatestarcodex.com/2014/11/25/race-and-justice-much-m...

>Machine learning does not have less bias than human researchers. It is simply magnified at scale.

Scale differences can and often do lead to qualitative differences.

Individual (or aggregate) human researchers are not hooked up in huge services to make inferences and deductions automatically about billions of people.

Besides those machine learning tools, beside the huge data sets, are programmed in their general framework by human researchers, and are given weights, constraints, and fine-tuning by them, so they have both kinds of biases.

>But sure demonizing the things you don't like is one step on the path to learning what's truly valuable.

So, kind of like disparaging via a straw-man a speech that offers detailed argumentation?

Individual (or aggregate) human researchers are not hooked up in huge services to make inferences and deductions automatically about billions of people

Yes they (we) are. It's the same data set. TV, movies, papers, internet videos et al. is all the same biased, labeled data that is being fed (watched, listened to etc...) to machines. You automatically make inferences and deduce things about people based on labeling and training of your brain. You're constantly fine tuning by getting new weights about things through interactions with others and media.

>Yes they (we) are.

I didn't say researchers and/or individual people are not making such judgements about billions of others (e.g. "the Chinese suck/are great").

I said they are not "hooked up in huge services" to make them automatically for billions of individuals -- like an ML algorithm used by Google or Amazon or some government agency etc would do.

My point is that it's the same thing. Individuals with outsized power and influence can affect billions of people based on their own judgments and implicitly make those judgments automatically for billions.

That's what this question is all about - should machine systems be responsible for the kind of sweeping decisions that humans are making on those populations now? Probably.

This is simply not true. Most algorithms can and will correct for biases in their inputs.

See this (somewhat technical) article where I go into explicit (simulations in numpy) levels of detail:

https://www.chrisstucchio.com/blog/2016/alien_intelligences_...

The best analogy I've come up with for the non-technical is that algorithms are like humans trying to draw inferences on octopus society. Some octopi might have bias against some other octopi, but it's the height of octopusthromorphism to to expect a human to reproduce that bias.

This is very optimistic. There are well known and documented cases of ml algorithm bias and its cause [1].

And it's not surprising that data itself contains some biases from the humans creating it. Suppose police are asking machine learning where more crime is committed - there will be a feedback loop. Where are they currently making more arrests? If they spend more time there, the bias will be exaggerated.

The op correctly gauges how we should be cautious. Your post, I'm afraid, is misleading at best.

[1] https://www.google.com/amp/s/www.technologyreview.com/s/6017...

Of course data contains biases. But again, please read the article I linked; algorithms will have a tendency to correct that bias.

The examples in the article you link to are not algorithmic bias at all. They consist of:

1) Humans at Facebook manipulating trending results.

2) Google's keyword algorithm (accurately) reflecting the fact that people with black names are more likely to have arrest records.

Lets distinguish "bias" from "accurately learning things you wish it wouldn't learn" or "accurately learning things you wish weren't true."

None of what I'm saying is remotely controversial. If I told you statistics could detect and correct bias in a mobile phone compass, you'd just think "cool stats bro". Is this article remotely controversial? https://www.chrisstucchio.com/blog/2016/bayesian_calibration...

The specific feedback loop you describe - variable detection probability => variable # of detections - can be directly mitigated. For a non-controversial example drawn from sensor networks (sensors report events with a delayed eraction, the longer you wait the more events you detect), see here: https://www.chrisstucchio.com/blog/2016/delayed_reactions.ht...

(You can find similar examples all over the place. I just link to the ones I wrote because they spring immediately to mind.)

In a compass, a sensor network, adtech or other quant finance, the idea that machine learning can fix biased inputs is not remotely controversial. The concept that statistics suddenly stops working to fix racism is just silly anthropomorphism.

Aha - I think I see our miscommunication. When you say bias you mean statistical bias.

Yes, machine learning is able to correct for that kind of bias - 538's polls forecast is a good example of that.

But you don't get to redefine racial bias to be something innocuous. Yes, black names are more likely to have arrest records, but that "fact" is super misleading [1].

Finally, you're talking past me. I'm not saying that statistics is broken. I'm saying that we should be especially mindful of the OPs point when they say this:

> So what’s your data being fried in? These algorithms train on large collections that you know nothing about. Sites like Google operate on a scale hundreds of times bigger than anything in the humanities. Any irregularities in that training data end up infused into in the classifier.

I think the OP author also has a related post about the kind of bias I'm talking about: http://idlewords.com/talks/sase_panel.htm

[1]: http://www.huffingtonpost.com/kim-farbota/black-crime-rates-...

Without getting into a dispute about the definition of "bias", I'm saying that algorithms can accurately measure reality even if input(x=white, all else equal) != input(x=black, all else equal).

You are saying that algorithms are accurately measuring a reality you wish were different. I don't disagree with this.

The right thing to do is to actually answer unpleasant moral questions like "if blacks are 4x more likely to be dangerous criminals, what should we do about it?" But I guess overloading the word "bias" is a nice substitute for clearly thinking things through.

The problem is you're modeling a biased reality. And accurately modeling a biased reality may in many cases accentuate the bias. Take for example the previously-mentioned case of using an algorithm to determine where to focus your policing efforts. If the data you have says that more arrests are done in a particular part of the city, then you'll want to put more police there, right? But areas where there are more police will tend to see more arrests. So the fact that you're putting more police in an area where you see more arrests is just going to make the bias more extreme, causing even more arrests there. This causes a feedback loop. So you may be accurately modeling reality, but you're modeling a pre-existing bias and making it worse. And who knows why that pre-existing bias was even there? The fact that there were more arrests there may not be because that area actually has more crime committed, it could be due to other factors, such as racial profiling by police, and in that case your algorithm is now accidentally racist because it's perpetuating racial profiling.
Are you saying that it can form a good estimate of the conditional probability ? I can believe that if the sampling process preserves the conditional.

Otherwise one would have to make assumptions about (or in other words, model) the corruption process. The bias compensation machinery then has to be deliberate, wont happen on its own.

Some sampling processes do not modify the conditional. In those cases no special machinery would be required.

To correct biased measurements (in a careful way) you need

1. Enough knowledge about the structure of the bias to be able to devise a model for it.

2. Some measurements from which to fit the model, with errors that are uncorrelated with the errors in your original data.

These things are not always easy to obtain, even in relatively mundane settings. It is also a distinctly non-automatic procedure - it requires someone to decide that a bias exists, to model it, obtain the relevant data, and fit the bias correction model, all before they can begin to obtain unbiased (or probably just less-biased) measurements.

I'm not making the claim that an algorithm magically fixes everything. I'm claiming that sometimes they do which makes bias less likely to be present in the ML model.

You don't need a human data scientist to decide bias exists, model it and fix it at all. If you read the post I linked to, you can observe a synthetic example of linear regression (with redundant encodings) accidentally fixing bias.

So yes, if your model is expressive enough and you have sufficient data, it will automatically fix bias. Is it really shocking that an algorithm which is good at finding hidden patterns will find a hidden pattern?

I don't really understand that claim. You are explicitly adding a bias that is linearly dependant on your race variable, and then allowing your regression to recover that bias by introducing noisy measurements of race (which you as the modeller knew was the thing causing the bias). As you say, that is unsurprising.

That result does not, however, address my point, which is that if the structure of the bias is difficult to understand, or perhaps even just difficult to model, and if relevant measurements (with errors that are uncorrelated with your original errors) are unavailable, then bias correction is essentially impossible.

The point is that the bias is linear, and my model is linear, so the model fixes things. The example is synthetic (so we could know what the right answer is and check if we recover it) so of course I put everything in.

In the linked article, I explicitly reference a real world case where the same linear model was used to discover that grades and test scores are biased in favor of blacks: http://ftp.iza.org/dp8733.pdf

In more complicated situations, the bias would need to be amenable to detection by a neural network, an SVM or random forest. The entire purpose of models like this is that lots of hidden patterns are detected.

Even if relevant measurements are unavailable, one can use redundant encoding to fix bias. Delip Rao explains redundant encoding here, for example, though he is more concerned that ML models might learn facts he wants to remain hidden: http://deliprao.com/archives/129

To remain with the example in your blog post, your model fixed things because the implicit bias model was correct (linear dependance on race), and the data were available, either directly (via the race variable) in the "What if measurements are biased?" section, or indirectly (via the noisy redundantly-encoded race variables) in the "What if we scrub race, but redundantly encode it?" section.

In the first of those two sections you yourself note how bias correction is not possible without the relevant data: "If we scrubbed the data this result would be impossible. Running least squares on scrubbed data yields alpha = [ 0.29878373, 0.30869833] - we can't correct for bias because we don't know the variable being biased on."

I'm not disputing that bias correction is possible, only that it can be much harder than you seem to be implying, with statements like "Most algorithms can and will correct for biases in their inputs.", and "Of course data contains biases. But again, please read the article I linked; algorithms will have a tendency to correct that bias."

I have some experience with bias correction in (ocean) weather forecasting, and in that domain there were problems both with the difficulty of modelling the bias structure, and with obtaining measurements reliable enough for bias correction.

> Machine learning does not have less bias than human researchers.

You are right that machine learning gains bias from the humans that created it, but unless they managed to transfer 100% of their biases to it, it will always have less bias.

We impart our bias on ML algorithms by choosing what data to use to train the AI on.

The problem, I think is one of self-selection.

Consider two hypothetical social networking websites - Friendface and FaceSpace. Friendface's userbase are mostly white users, while FaceSpace catered mostly to urban, black populations. And it would make sense too - you would only join a social network if your friends are on it. If you're white, chances are the majority of your friends are also white. And vice versa.

So Friendface is a lot more active on their ML front. The problem is when Friendface releases their data - because they're more active on the ML front, and ML scientists love to not have to collect their data, what happens is more and more models are trained on the Friendface data and more and more models are being optimized based on Friendface data. Apparent "structural" racism happens. Tumblrinas all pounce on it as if it were the biggest oppressive struggles of their lives.

A very cute thing to imagine in this scenario would be to imagine FaceSpace suddenly got good at NLP, and open sources their statistical language model. Recall that FaceSpace users are more likely to use AAVE in their communication, so what do you think the statistical language model would be?

In the original article, Maciej mentions "going to the community" - using crowd wisdom to handle these sorts of thing, and preferring to use open standards as opposed to silo'd standards (like writing your blog post on facebook... why??!!). While that sounds like a good idea, like I've mentioned in my other comment, it also sounds tiring as hell.

Firms act rationally (more or less)... ML is driven by huge companies with huge datasets. Why would they need to prune external datasets when they could just do their ML research with a few SQL queries?