Hacker News new | ask | show | jobs
by was_boring 2811 days ago
There are a few ways you can tackle this issue: 1) have the same algorithm for each group, but train separately (so in the end you have two different weights); 2) over-sample the group under represented in the data; 3) make the penalty more severe for guessing wrongly on female then male applicants during training; 4) apply weights to gender encoding; 5) use more then just resumes as data.

This isn't an insurmountable problem, but does require extra work then just "encode, throw it in and see what happens".

Amazon only scrapped the original team, but formed a new one in which diversity is a goal for the output.

1 comments

Or: don't include gender in the training data.
They didn’t. It was discovered through other signals (mention of membership in “women’s” clubs etc.
So they did. It should be obvious that if you don't want to include gender, then you have to sanitize gender-related data.
That's not as easy as one might think.

Machine learning generally doesn't have any prior opinions about things and will learn any possible correlation in the data.

It could for example discover that certain words or sentence structures used in the resume are more likely associated with bad candidates. Later you find out that <protected class> has a huge amount of people that use these certain words/structures while most other people don't.

And now the AI discriminates against them.

ML will pick up on any possible signal including noise.

More than that, though. Graduates of all-women colleges were also caught. If you're using school as a data point, that's extremely hard to sanitize.
Then what is the purpose of this? At some point you want this thing to "discriminate" (or "select", if this is a better word) people based on what they have done in life. Which is not negative per se.
But you don't want it to select based on gender.