Hacker News new | ask | show | jobs
by tlb 3570 days ago
Another recent paper on this topic: http://arxiv.org/pdf/1606.08813v3.pdf. It shows how naive lending algorithms can skew against minority groups simply because there is less data available about them, even if their expected repayment rate is the same.

It can be self-reinforcing. Imagine some new demographic group of customers appears, and without any data you make some loans to them. The actual repayment rate will be low, not because that group has a worse distribution than other groups, but simply because you couldn't identify the lowest-risk members. A simplistic ML model would conclude that the new group is more risky.

Of course, smart lenders understand that in order to develop a new customer demographic they need to experiment by lending, with the expectation that their first loans will have high losses, but that in the long run learning about how to identify the low-risk people from that demographic is worthwhile. And they correct for the fact that the first cohort was accepted blind when estimating overall risk for the group.

4 comments

Of course, this theory of discrimination is only applicable when minorities are fundamentally different from majorities. I.e., if the same ruleset is accurate for both whites and blacks (i.e., "I don't care about race, if he puts 20% down he's good"), this argument doesn't work at all - you can train your model on everyone and it'll work just fine.

However, if blacks and whites need to be treated fundamentally differently in order to make accurate loan decisions, then this argument applies. I.e., perhaps whites need a 20% downpayment for a loan to be financially a good risk but blacks need 40% (or vice versa).

I wonder how many people calling algorithms racist will endorse this conclusion. It sounds kind of...racist.

(Note that I don't use "racist" a synonym for "factually incorrect" or "we should not consider this idea", but merely "this sounds like the kind of thing a white nationalist might say, or Trump would be criticized for if he said".)

Yes, blacks are fundamentally different from whites in terms of the available data to train algorithms on:

http://www.nytimes.com/2015/10/31/nyregion/hudson-city-bank-...

> The government’s analysis of the bank’s lending data shows that Hudson’s competitors generated nearly three times as many home loan applications from predominantly black and Hispanic communities as Hudson did in a region that includes New York City, Westchester County and North Jersey, and more than 10 times as many home loan applications from black and Hispanic communities in the market that includes Camden, N.J.

That's of course, just recent history. Redlining that occurred in the 1960s on would be enough to adversely affect the housing history data of minority groups even today. Treating everyone equal in the eyes of the algorithm is certainly an easy route to go but as the non-algorithm expert MLK Jr. pointed out:

> Whenever the issue of compensatory treatment for the Negro is raised, some of our friends recoil in horror. The Negro should be granted equality, they agree; but he should ask nothing more. On the surface, this appears reasonable, but it is not realistic.

Did you read what I wrote? Available data on blacks specifically is completely irrelevant if blacks and whites aren't fundamentally different. The white model will generalize.

If repayment probability for blacks and whites alike is is A x downpayment_fraction + B x credit_score, you can use training data from whites and the model will accurately predict black repayment probability. It only fails if you actually need A' and B' for blacks.

As an example, maybe for whites A = 1.0 and for blacks A' = 0.75. In that case the optimal decision is to demand higher lending standards for blacks - a black person with a 40% downpayment would be treated the same as a white person with a 30% downpayment. Is this your belief?

Even in models where race doesn't directly cause an outcome, a model's judgements may be biased against a race.

For example, suppose that (1) people can be green or blue, (2) green people tend to live in Idaho, (3) living in Idaho is associated with people not paying back loans.

A linear model where there are only non-zero, positive coefficients for the path p(green) -> p(Idaho) -> p(fail_to_repay), and p(credit_score) -> p(fail_to_repay) will create trouble, even though color does not directly affect repayment. If you use a multiple regression with fail_to_repay ~ B0 + B1Idaho + B2credit_score, it will discriminate against green people, by penalizing people from Idaho.

AFAIK, one of the points of the paper linked in the parent comment is that blindly using indicators like IP address may indirectly lead to discrimination against a racial group in this way, e.g. p(racial_group) -> p(a_specific_IP_address).

Maybe more relevant to your example, though, is that assuming whites and blacks have the same model in the "ground-truth" scenario I presented could cause a model to be discriminative (when it shouldn't be, because the coefficient for the path from p(green) -> p(fail_to_repay) is 0).

This specific issue is hairy, and exists for traditional approaches also.

If I understand your model right, you are saying that Idahoans don't repay loans and your model accurately reflects this. This isn't a bias at all. The model is issuing fewer loans to green people not because they are green but because they live in Idaho and are unlikely to pay back said loans.

This is a case like what is described in the article - when a perfect predictor (another word for this is "reality" or "hindsight") will still exhibit disparate impact.

It is a bias if you calculate the cost to people taking out loans, based on color. Green people will pay a higher cost, even though in the ground-truth model their race is not directly related to loan repayment.

For example, if only blue people in Idaho fail to repay loans, green people will still absorb a greater cost in the multiple regression case above (in the sense that they are more likely to be penalized for being Idahoans).

OK, I guess I'm supposed to agree with you if I beg the question that "available data on blacks specifically is completely irrelevant"...? I do think that the distribution of data specific to blacks is relevant.
Ok, so now we have all acknowledged that we are "race realists" or "scientific racists" in this conversation. ( https://en.wikipedia.org/wiki/Scientific_racism )

Anyway we've now accepted blacks and whites may behave differently. For example, lets suppose we have all the training data we need to accurately recognize that one race doesn't pay back their loans as much as others, all else held equal.

What should we do about it? Concretely, how many bad loans should we issue in the name of "fairness"? How large a subsidy must the responsible races pay to the deadbeat ones?

I don't know if I nor Dr. King Jr. have to subscribe to scientific racism just because we subscribe to the reality that folks with of different racial backgrounds have a higher probability of being shortchanged historically. And thus, that any machine learning approach that doesn't factor this in will risk perpetuating such disadvantages, which kind of defeats the ostensible purpose for using machine learning to apply public policy in the first place.
I don't know what you mean by "fundamentally different" but there are definitely going to be demographic differences that the algorithm could use to predict race with good probability from hidden variables. (Where they live, for example.) History has an influence that's hard to remove from the dataset.

I'd guess that another reason this problem is hard is that it's about defining the goal correctly. It's not just maximizing repayment. There is some fairness goal that isn't well-defined.

By "fundamentally different", I mean that the most accurate model will be something like this:

    repayment_probability = 1 x downpayment_frac + 0.5 x credit_score + A x isBlack
for some A != 0. I.e., if A = -0.2, then a black borrower with a 60% downpayment is as likely to pay back a loan as a white borrower with a 40% downpayment.

If A = 0, then the bias described by tlb and danso won't occur.

What you describe with hidden variables is called "redundant encoding", and it's just a way of recovering the `A x isBlack` term if you remove `isBlack` from your input set. But if blacks and whites repay their loans at the same rate (holding all else equal), redundant encoding won't happen - it doesn't actually improve accuracy.

I describe this in more detail here: https://www.chrisstucchio.com/blog/2016/alien_intelligences_...

I agree with you that the core issue is an unspecified true goal. Folks are unwilling to publicly and explicitly state how many bad loans should be issued for fairness or how many unqualified students should be allowed into college for diversity.

Or for an example closer to home, how much we should lower the bar in order to hire more non-Asian minorities in tech? Daring to ask that question gets you some pretty hostile responses.

Repayment rates are not just individual - they also depend on the financial strength of friends and family who can help you out if you get in trouble. So, I think we have to assume that there are performance-relevant differences that an algorithm will detect.

Also, unless the dataset has information about families, this isn't based on your actual family. It's based on the average benefit people like you get from their family.

Lives of members of social groups can be different for historical reasons. Because of this, the best selection of features (as in, how we select and encode relevant aspects of a dataset for a particular problem) that one would need to use, as well as the correlations between them, may be different between different groups. The question is not whether there is fundamentally a difference between, in this case, racial groups vis-a-vis paying back loans (i.e. that the only feature required in your model would be "is member of this group"), but what are the traits of possibilities of life that have for whatever reason ended up leaving a quantifiable trace in our databases and their distribution within that group.

One hypothetical example: suppose that there existed a group G that was not able to go to the top n% of universities due to discrimination. Your company uses some rank of university attended as one of the features input to its favorite machine learning algorithm. However, the dataset you trained on excluded group G. Within this group, the best university individuals have been able to attend is X which is by definition not in the top n%. Had the algorithm been trained on this group it would have observed that school X is highly correlated with success in this group, even if not in the original training set used. As is, your ML system assigns a low probability to members of group G.

Issues like this will be hard to prevent. While that doesn't mean we shouldn't work hard to make real innovations in ML, I think the legal approach of a "right to explanation" as analyzed in http://arxiv.org/pdf/1606.08813v3.pdf and recently added to European law is regardless a helpful tool to ensure accountability.

Yes, if your training data excludes relevant features then you can't use them. No one disputes this.

However, once you start including such people in your training data, these issues are not hard to prevent. In fact, ML systems will often do this accidentally even when you don't want them to (when the sign of the bias has the politically incorrect direction). It's called redundant encoding.

See the section of my blog post "What if we scrub race, but redundantly encode it?" where I do calculations to show the effect of this: https://www.chrisstucchio.com/blog/2016/alien_intelligences_...

In short, if your data is biased against a group, but you include group membership either directly or via redundant encoding, your algorithm will fix the bias as best it can.

The entire purpose of machine learning is to discover hidden features and correlations in messy data, so I fail to see why this is considered surprising.

I generally consider the "right to explanation" to be a fairly transparent attempt by the EU to keep American tech companies out of Europe. The entire purpose of ML is that it can uncover true facts that humans can't. The right to explanation is just an attempt to hobble this power, probably because few Euro companies can do it.

In that case of being treated differently (requiring a different amount of downpayment as security), it's probably racist. More common and less controversial is the case when the signals are in different channels.

For instance, when dealing with immigrants, US banks often fail to see any signal at all because their credit reporting only covers US institutions, and they don't know how to verify employment or schooling abroad. So to start making loans to immigrants from any given country, they need to figure out what the signals are (job, schooling, ...) and how they correlate with risk.

> a Trump voter

Is that really necessary?

Some of us are treating the political system like a blackbox, I'm just sending a different corrupt payload at it to see what the output is.

Perhaps not. I've altered it.
Well, estimating higher risk due to lack of information, is not a glitch - rather the rational correct estimation. Say you're a complete stranger and want to hangout with me, this is pretty scary! However, if I know you, and you're a jerk - you might piss me off during the night, but at least I know you're not a serial killer...
Maybe you'd need an multi-armed bandit algorithm [1] to allow for some exploration of the dataset?

[1] https://en.wikipedia.org/wiki/Multi-armed_bandit

If this is the case competition will weed it out.
That's... optimistic. In the long run, maybe, but someone has to actually do it.