Hacker News new | ask | show | jobs
by fvdessen 2609 days ago
> Since Amazon’s current employee base skews male, the examples of ‘successful hires’ also, mechanistically, skewed male and so, therefore, did this system’s selection of resumés. Amazon spotted this and the system was never put into production.

Couldn't they have retrained the system with a 50/50 mix of males / females resumes ? Or restrict the use of the algorithm to sort male resumes ? Or maybe resumes don't actually correlate at all with success in Amazon ...

10 comments

One situation I could see leading to this result (Amazon cancelling their resume filtering software with the excuse that it 'skewed male') is that

1. The AI system accurately predicted employee success across both genders

AND

2. The AI system predicted that women would do worse than men

That's politically embarrassing and something that you can't necessarily 'fix' by improving the system. (see: all the 'will this person commit a crime if let out on parole' systems that end up accurately discriminating based on race)

This isn't to say that women are worse engineers than men, or anything of that sort - only that the applicant pool to Amazon was skewed, or women were treated worse in the workplace and thus performed worse, or a dozen other possible causes. (And only in this hypothetical scenario! I have no inside info from Amazon!)

Your example is quite possible, particularly at an organazation that would be embarrased by such a result.

Assume that the ability curve of male applicants and female applicants are identical; that the majority of applicants are male; and that Amazon wants to hire more females then would be expected given the portion of applicants that are female.

A natural way of accomplishing this goal is to give extra points to female applicants [0].

Due to selection bias, the ability curve of women within the population of Amazon engineers would skew lower then men within the population of Amazon engineers.

This is a special case of a more general phenomona. If you have signal S that is positivly correlated with a desired trait in the general population, and over select for S, you will find that S is negativly correlated within your population.

[0]. All proposals I have seen amount to either a good approximation of this or changing the applicant pool. And, by assumption, the latter is excluded.

In this case, it appears to instead be a matter of journalists focusing on totally the wrong aspect of a story for more drama. Buried deep in the original Reuters piece is this offhand mention:

> Gender bias was not the only issue. Problems with the data that underpinned the models’ judgments meant that unqualified candidates were often recommended for all manner of jobs, the people said. With the technology returning results almost at random, Amazon shut down the project, they said.

Apparently the recommendation system really did create gender bias, neither inherited from real differences nor from replicated human biases. (It looks like an issue with mismatched training data and task.) But that initial bias was found and corrected (2015) more than a year before the project was cancelled (2017) for providing "random" results. I think this is the most extreme case of algorithmic bias I've ever seen, but also the least commonly relevant; Amazon appears to have built a model which contained almost no rules except sexism, and scrapped it for not knowing anything worthwhile.

https://www.reuters.com/article/us-amazon-com-jobs-automatio...

That is certainly another plausible explanation - and a less culture-war infused one, too. Thanks!
This is feels like an elephant in the room when it comes to AI bias. We develop an AI that accurately predicts outcomes and discover it is biased, then instead of asking if maybe this means our current system is deeply biased and needs to be changed, we say, "don't use the AI; keep using the people who might or might not be biased but we don't know because we can't measure it in the way an AI can be measured."

If it isn't acceptable to use an AI to create biased outcomes how is it acceptable to use people to create the the same outcomes. AI decision making can be examined and tuned in ways that people cannot.

The problem is that AI and more generally 'algorithms' are or were presented as neutral and unbiased. As such their biased results prop up a biased system.

I don't think people are against using ML and for biased human systems. Just pointing out the ignorant, naive and lazy deference to computers that often occurs in human systems that share the same bias.

In short I'd think most people who are against biased AI are also against biased human systems for very similar reasons.

Of course, sometimes reality is also biased, and the AI systems are just accurately reflecting reality. And that's an even bigger elephant.
I’m not sure what that even means if we know we can bias outcomes. Pretending there is some kind of natural state that is for the sake of being natural preferred seems odd given humans propensity to change the world to suit. I also suspect for many that ‘reality’ is really just a dog whistle for their preferred biases. Not to mention the entire issue with deriving and ought from an is.
Suppose you train an AI to predict how good people are at weight lifting, trained from a bunch of seemingly unrelated data (maybe you want to hire bouncers or construction workers). You will find that the model predicts better performance for males. You notice this, identify that men are more likely to go to the gym than wimen, and modify your data to compensate for this. But when you rerun the model men still show better results. You find some other biases in your data. You find societal biases, like role models for girls not being physically strong. You even take some women and show that with training they outperform average men.

You can modify reality, but our understanding of biology - especially hormones - clearly tells us that the AI was right: men are generally better than women at weight lifting.

I'm not saying that every issue is like that, but it would be foolish to ignore that sometimes reality is biased, sometimes in obvious ways and sometimes more subtly.

One major problem.

The parole software was NOT being fed data for "will this person commit another crime". It was being fed data for, "will this person be a suspect for another crime".

The significant difference is that selective enforcement biases the data that it was trained on. Said selective enforcement has multiple causes, including the fact that heavier patrolling in black neighborhoods makes catching crimes more likely.

The size of the selective enforcement bias shows in a number of ways. For example consider drugs. In surveys, the usage of illegal drugs is the same in blacks and whites. And yet 6 times as many blacks are arrested for using illegal drugs as whites.

Which represents ground truth better? arrest records, or survey results?
For this? Probably survey results. Particularly https://nsduhweb.rti.org/respweb/homepage.cfm.
I think this retelling of the story is over-simplified. It's a compelling story, but I don't know any competent engineers who give up on a whole project because of one setback. If this system never saw production use, it was because it's still not ready, or there were many other issues that aren't mentioned that led the team to give up, or because political winds shifted. Amazon is famous for killing projects quickly.
It does make you wonder how much AI will be .. AI and how much guidance for desired outcomes humans will give it.

Humans are pretty happy to create nonsensical results if it fits their goals... especially if it befits them. I wonder if with AI we do that to the point that it is somewhat irrelevant.

The whole problem with allegations of AI bias is that people often point to disparities of outcome as proof of bias. The reality is that there are plenty of disparities on outcome regardless of bias, and the allegations of bias and attempt to rectify the alleged bias is another vector for the introduction of bias.
Sounds like an extraordinarily poor AI system if it depends on absolute numbers, and not per capita. And wouldn't the number of unsuccessful hires also skew male?
"Sounds like an extraordinarily poor AI system if it depends on absolute numbers, and not per capita."

To some extent, you're bringing in your human bias to prefer human biases when you make that statement. We humans have a hierarchy of important attributes, and for various reasons believe race and gender are more important than eye color or height. But the machine learning algorithm just gets a multidimensional point in hyperspace. It doesn't, a priori, "know" that it needs to do a "per capita" adjustment based on FIELD_1 any more than it knows it needs to do a per capita adjustment on FIELD_2. And you can't "adjust" on all the fields because that'll just cancel out.

We are also in the weird position of wanting the machine to do adjustments based on FIELD_1, but without us having to actually admit to ourselves that we're doing it. From a technical perspective, probably the best answer is to do a straight-up training based on the data, then have an cleanly-separated after-the-fact cleanup process to perform whatever social adjustments it is we want on the outcome. But nobody is willing to admit that's what we want, and to put those adjustments down on paper in the form of code, because the instant they're concrete, pretty much everybody is going to decide they're wrong, and no two people are going to agree on the manner in which they are wrong, and an epic, national-front-page-news shitstorm will ensue. So here we are, trying to make adjustments without making adjustments, or, alternatively, trying to make adjustments in a place where we can blame the AI rather than humans.

(The ironic thing is that because we can't admit what we're trying to do, we're going to end up doing a really poor job of it. Tools will be applied haphazardly, the results can't be measured except very grossly at the very end of the process, and the goals won't be obtained and the system is always going to be quirky and weird. If we could clearly declare what it is we actually wanted, it would be fairly easy to get it from the AIs.)

The basic "resumes skewed male so the algorithm did too" explanation appears to be incorrect. But it's found in the original Reuters story and most derived stories, and finding it here implies it's reached the level of urban legend.

Going by the details of the Reuters story and several others, it appears that what actually happened was a training/task mismatch. Amazon wanted an algorithm to do resume discovery, which recruiters would run and get quality predictions as they viewed resumes. But they trained it on resume results, giving it past resumes which had been submitted to Amazon and telling it to seek similar resumes. None of the stories make it clear if there even was negative training data; it looks like the tool was simply told to compute degree-of-similarity to past inputs, and possibly told to prioritize resumes which were ultimately hired.

As a result, the tool was trying to convert a relatively gender-neutral pool (resumes found online) to a skewed one (Amazon applicant resumes), and did so by weighting gendered terms. It also seems to have underweighted technical terms, failing to appreciate them as mandatory or strictly position-specific.

The developers were sufficiently aware of that to catch and correct the known gender biases (e.g. devaluing women's colleges or the literal word "women's"), but were scared there were other uncaught biases. And the results were apparently terrible all around, so the tool was scrapped. Which is pretty much what you'd expect from something trained on exclusively positive, sample-biased examples. The story has been seriously distorted, but the real plan also seems terrible...

Consider the possibility that the (pre-AI system) probability of success for a female applicant is the same as the probability of success of a male applicant. You could make a "per capita" quota as a kind of goal. That's not a problem, but how would you make sure the quota was met?

The typical AI system doesn't work on the basis of selecting candidates entirely at random, pro rata, in order to meet a quota. It works on the basis of criteria for success. One thing it might learn (unfortunately) is that most posts at the company are filled by men.

From a machine learning point of view, one can just add the constraint that the probability of being in the "yes" bucket is that same for both male and female candidates. Doing this will give a worse fit than an unconstrained optimization, but it is fairer.

More sophisticated approaches are possible.

There's no "just" to any aspect of this topic. I think what you are talking about is what is sometimes called "classification parity", and there are problems with it, and with everything else we've come up with to combat bias.

https://arxiv.org/abs/1808.00023

Or couldn't they provide data augmentation on the same samples to give the effect of a more diverse (and more populous) training set?

Using the blog's skin cancer example, couldn't the labelled images be augmented by altering the skin tones and adding these new examples to the training set?

It seems to me that some of the anomalous results discussed in the article are actually the result of poor model design or poor pre-processing data choices. We can't just throw anything to any ol' machine learning model and expect it to be magic

maybe, but this might also just be someone unwilling to commit to the sunk cost fallacy. You can spend time and money fixing it, or you can cut your losses and just stop trying to automate something that probably didn't need full automation to begin with.
This story has been constantly misrepresented, because Reuters absolutely botched their initial report. Amazon was never building a tool to decide which interviewed candidates to hire, they were building a tool for discovering candidates. It was biased, but that gender bias wasn't the proximate reason for scrapping the tool.

As far as I can tell from later stories (e.g. 1, 2), what Amazon actually did was build a tool to show recruiters 'quality' predictions for all resumes, for instance as they scrolled LinkedIn. But they trained it on resumes submitted to Amazon for various positions, possibly also adding weight to resumes which produced hires.

In which case the problem is painfully obvious; the system effectively had no negative training data, and its positive examples (submitted resumes) didn't actually match the desired output (qualified resumes). It was computing degree of similarity between a gender-neutral-ish pool (resumes posted online) and a gender-skewed pool (resumes submitted to Amazon), and tried to make that conversion with whatever data was available - like devaluing resumes that mentioned women's colleges. (This wasn't just a proxy-variable thing, the model essentially learned to weight on gender.) Amazon's team apparently caught this issue and did the usual things like blinding on those words. But they were scared of uncaught factors; reading between the lines, they were unable to "detrain" biases like neural nets do because their dataset and task didn't match.

Ultimately, the tool was apparently scrapped because it made selections "almost at random". Which, again, isn't exactly surprising in light of the absolutely bonkers choice of training examples.

[1] https://www.aclu.org/blog/womens-rights/womens-rights-workpl...

[2] https://www.ml.cmu.edu/news/news-archive/2018/october/amazon...

That wouldn't matter if the KPI (worker performance) being predicted, which is inherently biased as well, was distributed differently among the balanced pool of applicants.
Just remove the gender/sex as variables for the AI and maybe name too. Preprocess the resumes to remove them. Now you remove the majority of gender bias for the AI.
AI is really good at infering information. If gender is a real signal, it would be very difficult to filter the input such that it is not making a determination by what could be reffered to inferred gender.