Hacker News new | ask | show | jobs
by danso 3568 days ago
Yes, blacks are fundamentally different from whites in terms of the available data to train algorithms on:

http://www.nytimes.com/2015/10/31/nyregion/hudson-city-bank-...

> The government’s analysis of the bank’s lending data shows that Hudson’s competitors generated nearly three times as many home loan applications from predominantly black and Hispanic communities as Hudson did in a region that includes New York City, Westchester County and North Jersey, and more than 10 times as many home loan applications from black and Hispanic communities in the market that includes Camden, N.J.

That's of course, just recent history. Redlining that occurred in the 1960s on would be enough to adversely affect the housing history data of minority groups even today. Treating everyone equal in the eyes of the algorithm is certainly an easy route to go but as the non-algorithm expert MLK Jr. pointed out:

> Whenever the issue of compensatory treatment for the Negro is raised, some of our friends recoil in horror. The Negro should be granted equality, they agree; but he should ask nothing more. On the surface, this appears reasonable, but it is not realistic.

1 comments

Did you read what I wrote? Available data on blacks specifically is completely irrelevant if blacks and whites aren't fundamentally different. The white model will generalize.

If repayment probability for blacks and whites alike is is A x downpayment_fraction + B x credit_score, you can use training data from whites and the model will accurately predict black repayment probability. It only fails if you actually need A' and B' for blacks.

As an example, maybe for whites A = 1.0 and for blacks A' = 0.75. In that case the optimal decision is to demand higher lending standards for blacks - a black person with a 40% downpayment would be treated the same as a white person with a 30% downpayment. Is this your belief?

Even in models where race doesn't directly cause an outcome, a model's judgements may be biased against a race.

For example, suppose that (1) people can be green or blue, (2) green people tend to live in Idaho, (3) living in Idaho is associated with people not paying back loans.

A linear model where there are only non-zero, positive coefficients for the path p(green) -> p(Idaho) -> p(fail_to_repay), and p(credit_score) -> p(fail_to_repay) will create trouble, even though color does not directly affect repayment. If you use a multiple regression with fail_to_repay ~ B0 + B1Idaho + B2credit_score, it will discriminate against green people, by penalizing people from Idaho.

AFAIK, one of the points of the paper linked in the parent comment is that blindly using indicators like IP address may indirectly lead to discrimination against a racial group in this way, e.g. p(racial_group) -> p(a_specific_IP_address).

Maybe more relevant to your example, though, is that assuming whites and blacks have the same model in the "ground-truth" scenario I presented could cause a model to be discriminative (when it shouldn't be, because the coefficient for the path from p(green) -> p(fail_to_repay) is 0).

This specific issue is hairy, and exists for traditional approaches also.

If I understand your model right, you are saying that Idahoans don't repay loans and your model accurately reflects this. This isn't a bias at all. The model is issuing fewer loans to green people not because they are green but because they live in Idaho and are unlikely to pay back said loans.

This is a case like what is described in the article - when a perfect predictor (another word for this is "reality" or "hindsight") will still exhibit disparate impact.

It is a bias if you calculate the cost to people taking out loans, based on color. Green people will pay a higher cost, even though in the ground-truth model their race is not directly related to loan repayment.

For example, if only blue people in Idaho fail to repay loans, green people will still absorb a greater cost in the multiple regression case above (in the sense that they are more likely to be penalized for being Idahoans).

Yes, if it's actually (blue & Idaho) ~> default, and your model ignores blue, then the greens will pay a higher cost. If color is redundantly encoded then your model can partially fix this and penalize the blue's in Idaho.

Do you consider this situation unjust? If so, you might be unhappy to learn that the entire goal of the field of algorithmic fairness is to do something along these lines.

> Available data on blacks specifically is completely irrelevant if blacks and whites aren't fundamentally different. The white model will generalize.

I should have been more clear that I was responding to this part of your comment. That even if blacks and whites aren't fundamentally different (in the sense that your race does not directly cause an outcome of interest) you can produce biases that are essentially a misatrribution about the relationship between race and that outcome. Worse, if there_is_ a relationship you can reverse the direction a model estimates for the relationship (Simpson's paradox).

> Do you consider this situation unjust? If so, you might be unhappy to learn that the entire goal of the field of algorithmic fairness is to do something along these lines.

I don't think the creation of tools to accommodate this specific purpose is bad, per se. Whether or not they are the appropriate tool to use is a different question.

OK, I guess I'm supposed to agree with you if I beg the question that "available data on blacks specifically is completely irrelevant"...? I do think that the distribution of data specific to blacks is relevant.
Ok, so now we have all acknowledged that we are "race realists" or "scientific racists" in this conversation. ( https://en.wikipedia.org/wiki/Scientific_racism )

Anyway we've now accepted blacks and whites may behave differently. For example, lets suppose we have all the training data we need to accurately recognize that one race doesn't pay back their loans as much as others, all else held equal.

What should we do about it? Concretely, how many bad loans should we issue in the name of "fairness"? How large a subsidy must the responsible races pay to the deadbeat ones?

I don't know if I nor Dr. King Jr. have to subscribe to scientific racism just because we subscribe to the reality that folks with of different racial backgrounds have a higher probability of being shortchanged historically. And thus, that any machine learning approach that doesn't factor this in will risk perpetuating such disadvantages, which kind of defeats the ostensible purpose for using machine learning to apply public policy in the first place.
Historically isn't the issue. The issue is a simple factual question of whether, all else held equal, black people repay their loans at the same rate as whites in identical financial circumstances. The fact that in aggregate financial circumstances might be different isn't important to this question.

If they do, then you don't need to worry about algorithms discriminating. Insofar as they do it's merely a sampling error (i.e. shrinks like O(1/sqrt(N)), where N = Nwhite + Nblack) and they are just as likely to discriminate in favor as against.

If they don't, then you subscribe to scientific racism, or the belief that blacks and whites in identical circumstances behave fundamentally differently.

(I describe these different cases in explicit detail here: https://www.chrisstucchio.com/blog/2016/alien_intelligences_... )

So do you believe race affects reality independent of other factors? And assuming you do subscribe to scientific racism, what should we do about it?

> The issue is a simple factual question of whether, all else held equal, black people repay their loans at the same rate as whites in identical financial circumstances

Oh if you put it that way, then I don't know. Because that's not the reality that's being dealt with, in which whites and blacks have identical circumstances. I think you're reading something into this that others aren't.

FWIW, I present a case where race does not directly cause increased failure to repay, but common approaches to modeling could discriminate against race.

These issues have been discussed in detail in statistical considerations of Simpson's paradox. One need not accept that racial differences directly affect an outcome of interest, in order to be concerned about a model being biased against race!

I don't know what you mean by "fundamentally different" but there are definitely going to be demographic differences that the algorithm could use to predict race with good probability from hidden variables. (Where they live, for example.) History has an influence that's hard to remove from the dataset.

I'd guess that another reason this problem is hard is that it's about defining the goal correctly. It's not just maximizing repayment. There is some fairness goal that isn't well-defined.

By "fundamentally different", I mean that the most accurate model will be something like this:

    repayment_probability = 1 x downpayment_frac + 0.5 x credit_score + A x isBlack
for some A != 0. I.e., if A = -0.2, then a black borrower with a 60% downpayment is as likely to pay back a loan as a white borrower with a 40% downpayment.

If A = 0, then the bias described by tlb and danso won't occur.

What you describe with hidden variables is called "redundant encoding", and it's just a way of recovering the `A x isBlack` term if you remove `isBlack` from your input set. But if blacks and whites repay their loans at the same rate (holding all else equal), redundant encoding won't happen - it doesn't actually improve accuracy.

I describe this in more detail here: https://www.chrisstucchio.com/blog/2016/alien_intelligences_...

I agree with you that the core issue is an unspecified true goal. Folks are unwilling to publicly and explicitly state how many bad loans should be issued for fairness or how many unqualified students should be allowed into college for diversity.

Or for an example closer to home, how much we should lower the bar in order to hire more non-Asian minorities in tech? Daring to ask that question gets you some pretty hostile responses.

Repayment rates are not just individual - they also depend on the financial strength of friends and family who can help you out if you get in trouble. So, I think we have to assume that there are performance-relevant differences that an algorithm will detect.

Also, unless the dataset has information about families, this isn't based on your actual family. It's based on the average benefit people like you get from their family.