Hacker News new | ask | show | jobs
by lalaland1125 2811 days ago
> The number of women and men in the data set shouldn't matter (algorithms learn that even if there was 1 woman, if she was hired then it will be positive about future woman candidates).

This is incorrect. The key thing to keep in mind is that they are not just predicting who is a good candidate, they are also ranking by the certainty of their prediction.

Lower numbers of female candidates could plausibly lead to lower certainty for the prediction model as it would have less data on those people. I've never trained a model on resumes, but I definitely often see this "lower certainty on minorites" thing for models I do train.

The lower certainty would in turn lead to lower rankings for women even without any bias in the data.

Now, I'm not saying that Amazon's data isn't biased. I would not be surprised if it were. I'm just saying we should be careful in understanding what is evidence of bias and what is not.

4 comments

It's wrong even if their model doesn't output a certainty (not all classifiers do). Almost all ML algorithms optimize the expected classification error under the training distribution. So if the training data contains 90% men, it's better to classify those men at 100% accuracy and women at 0% accuracy, than it is to classify both with 89.9% accuracy. Any unsophisticated model will do this.

gp: "The number of women and men in the data set shouldn't matter (algorithms learn that even if there was 1 woman, if she was hired then it will be positive about future woman candidates)."

This is false for typical models.

> The lower certainty would in turn lead to lower rankings for women even without any bias in the data.

This is not true.

Probabilistic-ly speaking, if we are computing P(hiring | gender); Lower certainty means there is a high variance in prior over women. However, over a large dataset, the "score" would almost certainly be equal to the mean of the distribution, and be independent of the variance.

In simpler words, if there was a frequency diagram of scores for each gender (most likely bell curves), then only the peak of the bell curve would matter. The flatness / thinness of the curve would be completely irrelevant to the final score. The peak is the mean, and the flatness is the uncertainty. Only the mean matters.

There's not enough information about how their ML algorithm works, nor how large their dataset was for any of the above reasoning to be justified. Fwiw, many ranking functions do indeed take certainty into account, penalizing populations with few data points.
If they were using any sort of neural networks approach with stochastic gradient descent, the network would have to spend some "gradient juice" to cut a divot that recognizes and penalizes women's colleges and the like. It wouldn't do this just because there were fewer women in the batches, rather it would just not assign any weight to those factors.

Unless they presented lots of unqualified resumes of people not in tech as part of the training, which seems like something someone might think reasonable. Then, the model would (correctly) determine that very few people coming from women's colleges are CS majors, and penalize them. However, I'd still expect a well built model to adjust so that if someone was a CS major, it would adjust accordingly and get rid of any default penalty for being at a particular college.

If the whole thing was hand-engineered, then of course all bets are off. It's hard to deal well with unbalanced classes, and as you mentioned, without knowing what their data looks like we can only speculate on what really happened.

But I will say this: this is not a general failure of ML, these sorts of problems can be avoided if you know what you're doing, unless your data is garbage.

> It wouldn't do this just because there were fewer women in the batches, rather it would just not assign any weight to those factors.

That's exactly the issue we are talking about here. Woman's colleges would have less training data so they would get updated less. For many classes of models (such as neural networks with weight decay or common initialization schemes) this would encourage the model to be more "neutral" about women and assign predictions closer to 0.5 for them. This might not affect the overall accuracy for women (as it might not influence whether or not they go above or below 0.5), but it would cause the predictions for women to be less confident and thus have a lower ranking (closer to the middle of the pack as opposed to the top).

I don't think I'm with you. A neural net cannot do this - picking apart male and female tokens requires a signal in the gradients that force the two classes apart. If there's no gradient, then something like weight decay will just zero out the weights for the "gender" feature, even if it's there to begin with. Confidence wouldn't enter in, because the feature is irrelevant to the loss function.

A class imbalance doesn't change that: if there's no gradient to follow, then the class in question will be strictly ignored unless you've somehow forced the model to pay attention to it in the architecture (which is possible, but would take some specific effort).

What I'm suggesting is that it's likely that they did (perhaps accidentally?) let a loss gradient between the classes slip into their data, because they had a whole bunch of female resumes that were from people not in tech. That would explain the difference, whereas at least with NNs, simply having imbalanced classes would not.

supposing waiter and waitress are both equally qualifying for a job, and most applicants are men, won't the ai score waiter as being more valuable than waitress?
How did you control for these things? Wondering what patterns there are that people use to prevent social discrimination.

Seems challenging since much of AI, especially classification, is essentially a discrimination algorithm.

There are a few ways you can tackle this issue: 1) have the same algorithm for each group, but train separately (so in the end you have two different weights); 2) over-sample the group under represented in the data; 3) make the penalty more severe for guessing wrongly on female then male applicants during training; 4) apply weights to gender encoding; 5) use more then just resumes as data.

This isn't an insurmountable problem, but does require extra work then just "encode, throw it in and see what happens".

Amazon only scrapped the original team, but formed a new one in which diversity is a goal for the output.

Or: don't include gender in the training data.
They didn’t. It was discovered through other signals (mention of membership in “women’s” clubs etc.
So they did. It should be obvious that if you don't want to include gender, then you have to sanitize gender-related data.
That's not as easy as one might think.

Machine learning generally doesn't have any prior opinions about things and will learn any possible correlation in the data.

It could for example discover that certain words or sentence structures used in the resume are more likely associated with bad candidates. Later you find out that <protected class> has a huge amount of people that use these certain words/structures while most other people don't.

And now the AI discriminates against them.

ML will pick up on any possible signal including noise.

More than that, though. Graduates of all-women colleges were also caught. If you're using school as a data point, that's extremely hard to sanitize.
> The lower certainty would in turn lead to lower rankings for women even without any bias in the data.

I don't think that's true. "No bias" means that gender is irrelevant (i.e. its correlation with outcome is 0%). Therefore the system shouldn't even take it into account - it would evaluate both men and women just by other criteria (experience, technical skills, etc), and it would have equal amounts of data for both (because it wouldn't even see them as different).

You need bias to even separate the dataset into distinct categories.

> "No bias" means that gender is irrelevant

False. If we're talking about the technical statistical definition, bias means systematic deviation from the underlying truth in the data -- see this article by Chris Stucchio with some images for clarification:

https://jacobitemag.com/2017/08/29/a-i-bias-doesnt-mean-what...

"In statistics, a “bias” is defined as a statistical predictor which makes errors that all have the same direction. A separate term — “variance” — is used to describe errors without any particular direction.

It’s important to distinguish bias (making errors with a common direction) from variance which is simply inaccuracy with no particular direction."

I think the comments I replied to mean bias as in “sexist bias”.
Bias as in racism, sexism, etc, has multiple definitions, some of which are mutually exclusive.
Well, it was clear that _you_ think so.

My point was that you should consider the meaning of the word under which the post you're replying to is correct, especially given that the author was claiming specific domain experience.

The original was:

> The lower certainty would in turn lead to lower rankings for women even without any bias in the data.

your post said:

> If we're talking about the technical statistical definition, bias means systematic deviation from the underlying truth in the data

So I think my interpretation is correct, even though it's not "the technically statistically correct usage". You were referring to the bias of the algorithm (i.e. the mean divergence from the mean in the data), whereas we were referring to the "hiring bias" evident in the data. In fact, your "bias" was mentioned as "lower rankings for women" - i.e. "the algorithm would have (statistical) bias even without (sexist) bias in the data" and I was replying that I think that's false.

Question: So technically, the AI is not bias against women per se, but a set of characteristics / properties, that are more common among women.

I'm not trying to split hairs (or argue), as much as further clarify the difference between (the common definition of) human bias and that of statistical bias.

Correct.

Computers are very bad at actually discriminating against people, they will pick up a possible bias in a statistical dataset (ie, <protected class> uses certain sentence structure and is statistically less likely to get or keep the job).

Sometimes computers also pick up on statistical truths that we don't like, ie, you assign a ML to classify how likely someone is to pay back their loan and it picks up on poor people and bad neighborhoods, disproportionately affecting people of color or low income households. In theory there is nothing wrong with the data, after all, these are the people who are least likely to pay back a loan, but our moral framework usually classifies this as bad and discriminatory.

Machine Learning (AI) doesn't have moral frameworks and doesn't know what the truth is. The answers it can give us may not be answers we like or want or should have.

on a side note; human bias is usually not that different since the brain can be simplified as a bayesian filter; there are predictions on the present based on past experience, reevaluation of past experience based on current experience and prediction of future experience based on past and current experience. It's a simplification but usually most human bias is based on one of these, either explicitly social (bad experience with certain classes of people) or implicitly (tribalism).

> the brain can be simplified as a bayesian filter

I agree with everything else in your post, but just wanted to note that while this is true to some extent, the brain is much less rational than a pure Bayesian inference system; there are a lot of baked in heuristics designed to short-circuit the collection of data that would be required to make high-quality Bayesian inferences.

This is why excessive stereotyping and tribalism are a fundamental human trait; a pure Bayesian system wouldn't jump to conclusions as quickly as humans do, nor would it refuse to change its mind from those hastily-formed opinions.

> the AI is not bias against women per se

I think I'd make the claim a bit less strongly -- we don't know if there is statistical bias or non-statistical/"gender bias" in the data; both are possible based on what we know.

However exploring the statistical bias possibility, the simple way this could happen is if the data have properties like:

1. For whatever reason, fewer women than men choose to be software engineers 2. For whatever reason, the women that choose to be software engineers are better at it than men

(Note I'm just using hypotheticals here, I'm not making claims about the truth of these, or whether it's gender bias that they are true/false).

Depending on how you've set up your classifier, you could effectively be asking "does this candidate look like software engineers I've already hired"? If so, under the first case, you'd correctly answer "not much". Or you could easily go the other way and "bias" towards women if you fit your model to the top 1% where women are better than men, in our hypothetical dataset.

This would result in "gender bias" in the results, but there's no statistical bias here, since your algorithm is correctly answering the question you asked. It's probably the wrong question though!

Figuring out if/when you're asking the right question is quite difficult, and as the sibling comment rightly pointed out, sometimes (e.g. insurance pricing) the strictly "correct" result (from a business/financial point of view) ends up being considered discriminatory under the moral lens.

This is why we can't just wash our hands of these problems and let a machine do it; until we're comfortable that machines understand our morality, they will do that part wrong.