Hacker News new | ask | show | jobs
by gambler 2620 days ago
>The most obvious and immediately concerning place that this issue can be manifested is in human diversity.

I swear, when someone starts building autonomous killer robots, the first set of concerned articles will probably be asking whether robots were properly trained to target all genders and races with equal accuracy. This is not a sensible way to approach AI ethics.

>It was recently reported that Amazon had tried building a machine learning system to screen resumés for recruitment. Since Amazon’s current employee base skews male, the examples of ‘successful hires’ also, mechanistically, skewed male and so, therefore, did this system’s selection of resumés.

There is nothing "mechanistic" about this. It depends on how you select sample resumes and how you split them between "good" and "bad" labels.

I worked on a similar thing as an "encouraged" side-project at a certain company. Except I realized from day 1 that using AI on resumes is a bad idea and aimed to show this with data. My model was aiming to detect people who will quit or get fired within first 6 month (with the intent of lowering them in priority for interviews, supposedly). It miraculously achieved 85% accuracy... by figuring out how to detect summer interns.

Framing this problem as "bias" and especially hyper-focusing everyone's attention on diversity aspect of it is extremely irresponsible. (I'm not saying that's what the author is doing, but that's definitely what's being done at large.) Fundamentally, there are significant higher-level problems with using statistical ML models for things like hiring or crime prediction.

7 comments

That intern story is excellent; I'm adding it to my bank of "weird AI tricks" like pausing Tetris to avoid losing.

More topically, you're quite right to object to that Amazon reference. As far as I can tell, the real story is even worse than mislabeling. Amazon devs wanted a system to spot candidates in resume banks, so they trained it to recognize resumes similar to the ones submitted to Amazon in the past. The entire dataset was 'positive', and output degrees of similarity instead of classifications. Amazon applicants are mostly male while the pool was presumably 50/50, so that was learned as an element of "Amazon-candidate-ness".

That's also an interesting story, but from the first publication (in Reuters) it's been framed as an uneven base rate 'inevitably/predictably/mechanistically' producing a biased result. Which is not only untrue but downright backwards, since it implies that the rate in the general data is what matters, rather than the relative rate between samples and positive classifications. It's yet another variant of the mammogram base rates question, and I wish people would stop trying to reinforce the incorrect answer to that.

> That intern story is excellent; I'm adding it to my bank of "weird AI tricks" like pausing Tetris to avoid losing.

Post your bank! Let's be like Magnus Carlson and occasionally ask ourselves, "What would DeepMind do?"

Oh man, good question. I'm always up for swapping these stories. A lot of these came from a paper on weird AI tricks, and resulting best-of list on a blog collecting these stories.[1][2] Suffice to say, the people who think the orthogonality thesis is a weird hypothetical aren't keeping up with the state of things.

- The aforementioned Tetris story: an undirected learner was set to maximize score at Tetris learned normal play techniques, but also learned to pause the game immediately before losing so that the score wouldn't "decline" at game over.

- In the same vein as interns quitting, proxy detection of all sorts. Identify "field with sheep" by finding green fields with grey skies, or letting heuristics like "humans pick up dogs and cats" override correct identifications. (It's a goat until you pick it up, then it's a dog!)

- An agent playing Q*bert found a known bug for infinite lives, then escalated to an unknown bug which disabled the game while overflowing the score counter.

- Agents in a physics sim tasked with jumping as high as possible instead learned to 'fly' by abusing collision detection bugs, hitting themselves in ways that created upward momentum.

- Another "maximize jump height" task demonstrated that "highest" is an extremely fuzzy term. Initially measured by highest point, they became incredible tall. Measured by lowest point, they stayed tall and grew topheavy to 'kick' their base upwards.

- Number-handling bugs of all kinds. In one case, small twitches led to floating-point errors that created energy. In another, a "minimize force" task got solved by maximizing force and triggering integer wraparound.

My personal favorite is an adversarial bug. An agent playing tic-tac-toe on an infinite grid with a time limit submitted extremely remote moves which caused timeouts/crashes in any agent that tried to model the full board.

[1] https://arxiv.org/pdf/1803.03453.pdf

[2] https://aiweirdness.com/post/172894792687/when-algorithms-su...

You’ve just read a long article that covers many aspects and zeroed in on your own hobby horse. You say there’s significantly bigger issues, but you don’t actually talk about that. Instead you talk about the thing you just said you didn’t think people should be talking about. There’s some serious projection going on here.
>Framing this problem as "bias"

Except that's exactly what it is. Much as your model was biased against interns.

> and especially hyper-focusing everyone's attention on diversity aspect of it is extremely irresponsible.

Why? Pointing out a specific and concrete harm badly designed ML models cause is irresponsible? Just because the same kind of methodological flaw can cause other harms its irresponsible to use a motivating example?

>Why? Pointing out a specific and concrete harm badly designed ML models cause is irresponsible?

In my opinion, yes, if it leads most readers to misjudge some fundamental properties of the problem as a whole. Again, I'm not saying this article is guilty, but most are.

> In my opinion, yes, if it leads most readers to misjudge some fundamental properties of the problem as a whole.

Which problem? The general statement of this problem is "models, trained on [somehow] misrepresentative data [or even technically representative data] can draw unintended conclusions that lead to harm". Specifically in this case, the harm was "the model was basically just trained to ignore all women applicants due to bad inference of conditional probabilities".

This is a common thing. Because our society draws lines and has bias, its fairly common for modelling failures to exist along those lines. Indeed, sometimes the failures are mostly harmless and immediately obvious, but often they aren't. And people building models should be made aware of those failure scenarios, and be especially aware of failure scenarios that affect underrepresented groups, because those groups are the most likely for the model to fail on if you aren't actively looking for them.

And this stuff is pervasive. Facial recognition tech is much worse at noticing the faces of darker skinned people [1]. Some of this is because the people building the common models (eigenfaces etc.) didn't use diverse skin tones, but some of it goes back further, white balance in film was tuned for lighter skin tones until the 90s[2]. Some of that has likely persisted into modern film and camera technology, unfortunately. People working with data need to understand their data. And that means understanding how bias infests their data.

> fundamental properties of the problem as a whole

You've yet to state the "whole problem" or the fundamental properties that people might misjudge. So I'm unclear what they are.

[1]: Arguably an advantage now.

[2]: https://petapixel.com/2015/09/19/heres-a-look-at-how-color-f...

>Which problem? The general statement of this problem is "models, trained on [somehow] misrepresentative data [or even technically representative data] can draw unintended conclusions that lead to harm".

Throwing AI at answering an ill-formed question or optimizing a process that shouldn't happen in the first place is not something that can be corrected by getting better training data.

Moreover, automation can have consequences that aren't detectable by analyzing some test set.

> Except that's exactly what it is.

Using the term 'bias' has certain political motivations behind it. It's not about the term being technically untrue as it is about the term being non-neutral. For instance, here are some definitions of 'bias' I just grabbed from American Heritage:

"A preference or an inclination, especially one that inhibits impartial judgment."

"An unfair act or policy stemming from prejudice."

"A statistical sampling or testing error caused by systematically favoring some outcomes over others."

The ML model does not have a preference, inclination, or prejudice relating to interns, except insofar as we anthropomorphize it to have them. What does using a word suggesting that add?

A more neutral account of what's going on is along the lines: It's easy to accidentally train ML models so that they will make systematic errors. (Among those errors is the possibility for it to exhibit behavior resembling prejudice.)

Fine: it's easy to accidentally train ML models so that they will make systematic errors. Often these errors stem from systematic biases in our society, model creators should therefore be aware of the potential biases[1] that their models could reflect, and how to prevent them.

[1]: With the political motivation.

> Often these errors stem from systematic biases in our society ...

Depending on the what the appropriate quantification of 'often' is, that might make sense. Do we have enough reason to believe it would take on a high enough value to merit the usage of a term that refers only to it?

The other problem with what you're describing is that all we actually know is that the model is reflecting the current state of things. Your statement attributes particular causes to the current state of things, and implies a certain valuation of the current state of things (which I don't personally disagree with, necessarily—but I don't think my personal views should be reflected in scientific/engineering jargon).

So given the uncertain value of 'often,' and the unsettled nature of the causes behind various aspects of the 'current state of things,' it seems to be solidly jumping the gun to frame the entire general problem with a term that refers to this partial and fraught aspect of it.

>Your statement attributes particular causes to the current state of things

I didn't, nor should it matter how we got to where we are for a builder of a thing.

> and implies a certain valuation of the current state of things

This may have happened, but I'd disagree: recognizing that there exists inequality doesn't cast value judgement on that inequality. I simply stated that they're there. Perhaps saying "how to prevent them" is casting value judgement, so I might walk that back, model creators should be aware of the biases and aware of tools and strategies to account for them, if so desired.

Personally I think you're a bad person if, armed with the tools to detect and correct, you decide its okay to build something that has a systemic error that wrongly disfavors some group. But perhaps that's just me.

> ... recognizing that there exists inequality doesn't cast value judgement on that inequality.

You just asserted your attribution of cause right there: inequality. There are multiple possible causes for differing demographic representations in various roles. This is not a settled issue, even though people on both sides promote competing ideologies to the effect that it is.

(And again, I have intentionally left my own views on the subject out of this, even though I suspect they align with yours (insofar as cause attribution goes): I'm just pointing out the fact that this isn't something society agrees on, nor is it something the scientific data resolves unambiguously.)

> Personally I think you're a bad person if, armed with the tools to detect and correct, you decide its okay to build something that has a systemic error that wrongly disfavors some group.

Agreed, hinging on that point about cause attribution.

> Often these errors stem from systematic biases in our society

No, this does also not match.

One of the easiest way to get a ML model that creates systematic errors is spam filters. If I take my spam folder with no consideration, what the filter will learn is that any language which isn't my own are spam, and that servers located outside my nation are spammers. This resembles prejudice.

The cause of this systematic error is that individual email addresses do not get ham emails uniformly from every nation and every language. Proximity warps the data. I would need to normalize the data based on language and nation if I wanted to remove those errors in the filter. Looking at it from a political perspective does not make the filter perform better, and fixing it from that side has a high risk of causing even more errors in the model.

> There is nothing "mechanistic" about this. It depends on how you select sample resumes and how you split them between "good" and "bad" labels.

Isn't that what the article is trying to say, though? That your model can only be as accurate as your data set… and that even then, you have to be very careful to make sure it's not inferring patterns from entirely unrelated information?

If you train an AI using data from a system that already has certain biases, then the AI is going to replicate those same systemic biases in it's own predictions. It follows the "garbage in, garbage out" idiom.

Curiously though, did you compare the non-hire (full time) rates of interns vs fire rates of non-interns?

>If you train an AI using data from a system that already has certain biases, then the AI is going to replicate those same systemic biases

That's not what happened in the example at all. The example company isn't biased against summer interns, "who stops working after x time" was just a bad question.

The comment you're replying to can boil down to "do you want a monkey's paw solving your problem? If so then AI may be for you"

Or perhaps "stop pretending you're ever going to get ethics or empathy out of a computer"

I was referring to the Amazon resume model. The intern hires model was labeling, as GP said.
>did you compare the non-hire (full time) rates of interns vs fire rates of non-interns?

Not sure I understand the question. IIRC, the way data was setup there was no way to tell why an intern stopped working for the company, because for all interns "reason code" for separation was the same.

Isn't this the one of the major concerns of ML, the bias-variance trade-off? By creating a low-variance model, we create a highly biased model that misses some of the important feature relationships necessary to create a truly generalized model?

Meaning, isn't it prudent to spending time on this issue?

You are conflating bias (error) with bias (fairness).
Haha my comment was originally drafted to accuse the parent comment of the same. As I read the article, its concerned with error (e.g. misclassification of cancer) but the parent comment translated this to mean the social bias.
I get the point, but why didn't you just exclude intern resumes from the training data? Do you still suspect a skewed result?
>I get the point, but why didn't you just exclude intern resumes from the training data?

That was the logical next step and we started on that, but it required exporting more historic data out of the HR system and filtering out anyone who started as an intern as well. Sounds simple, but in practice it's anything but. Just for the reference, data extraction, cleaning and filtering in that project took at least an order of magnitude more time than anything related to machine learning.

The project eventually lost steam and got abandoned.

>Do you still suspect a skewed result?

Absolutely. My personal intuition is that there is very little correlation between resumes and candidate quality. If that is true, any seemingly accurate predictions would be the result of a similar problem. Testing this hypothesis was a large portion of why I agreed to work on the project in the first place.

@gambler: thank you for reading my reporting. I would love to chat confidentially to understand your perspective better. Please see my HN profile.
@sfreporter your reporting had the main lede quite buried to create sensationalism. Gender bias is a distant problem if the model's results are completely random.

Gender bias was not the only issue. Problems with the data that underpinned the models’ judgments meant that unqualified candidates were often recommended for all manner of jobs, the people said. With the technology returning results almost at random, Amazon shut down the project, they said.

Got it. Thanks for the response.