| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by yummyfajitas 3615 days ago

Did you read what I wrote? Available data on blacks specifically is completely irrelevant if blacks and whites aren't fundamentally different. The white model will generalize.

If repayment probability for blacks and whites alike is is A x downpayment_fraction + B x credit_score, you can use training data from whites and the model will accurately predict black repayment probability. It only fails if you actually need A' and B' for blacks.

As an example, maybe for whites A = 1.0 and for blacks A' = 0.75. In that case the optimal decision is to demand higher lending standards for blacks - a black person with a 40% downpayment would be treated the same as a white person with a 30% downpayment. Is this your belief?

3 comments

closed 3614 days ago

Even in models where race doesn't directly cause an outcome, a model's judgements may be biased against a race.

For example, suppose that (1) people can be green or blue, (2) green people tend to live in Idaho, (3) living in Idaho is associated with people not paying back loans.

A linear model where there are only non-zero, positive coefficients for the path p(green) -> p(Idaho) -> p(fail_to_repay), and p(credit_score) -> p(fail_to_repay) will create trouble, even though color does not directly affect repayment. If you use a multiple regression with fail_to_repay ~ B0 + B1Idaho + B2credit_score, it will discriminate against green people, by penalizing people from Idaho.

AFAIK, one of the points of the paper linked in the parent comment is that blindly using indicators like IP address may indirectly lead to discrimination against a racial group in this way, e.g. p(racial_group) -> p(a_specific_IP_address).

Maybe more relevant to your example, though, is that assuming whites and blacks have the same model in the "ground-truth" scenario I presented could cause a model to be discriminative (when it shouldn't be, because the coefficient for the path from p(green) -> p(fail_to_repay) is 0).

This specific issue is hairy, and exists for traditional approaches also.

yummyfajitas 3614 days ago

If I understand your model right, you are saying that Idahoans don't repay loans and your model accurately reflects this. This isn't a bias at all. The model is issuing fewer loans to green people not because they are green but because they live in Idaho and are unlikely to pay back said loans.

This is a case like what is described in the article - when a perfect predictor (another word for this is "reality" or "hindsight") will still exhibit disparate impact.

closed 3614 days ago

It is a bias if you calculate the cost to people taking out loans, based on color. Green people will pay a higher cost, even though in the ground-truth model their race is not directly related to loan repayment.

For example, if only blue people in Idaho fail to repay loans, green people will still absorb a greater cost in the multiple regression case above (in the sense that they are more likely to be penalized for being Idahoans).

yummyfajitas 3614 days ago

Yes, if it's actually (blue & Idaho) ~> default, and your model ignores blue, then the greens will pay a higher cost. If color is redundantly encoded then your model can partially fix this and penalize the blue's in Idaho.

Do you consider this situation unjust? If so, you might be unhappy to learn that the entire goal of the field of algorithmic fairness is to do something along these lines.

closed 3614 days ago

> Available data on blacks specifically is completely irrelevant if blacks and whites aren't fundamentally different. The white model will generalize.

I should have been more clear that I was responding to this part of your comment. That even if blacks and whites aren't fundamentally different (in the sense that your race does not directly cause an outcome of interest) you can produce biases that are essentially a misatrribution about the relationship between race and that outcome. Worse, if there_is_ a relationship you can reverse the direction a model estimates for the relationship (Simpson's paradox).

> Do you consider this situation unjust? If so, you might be unhappy to learn that the entire goal of the field of algorithmic fairness is to do something along these lines.

I don't think the creation of tools to accommodate this specific purpose is bad, per se. Whether or not they are the appropriate tool to use is a different question.

danso 3614 days ago

OK, I guess I'm supposed to agree with you if I beg the question that "available data on blacks specifically is completely irrelevant"...? I do think that the distribution of data specific to blacks is relevant.

yummyfajitas 3614 days ago

Ok, so now we have all acknowledged that we are "race realists" or "scientific racists" in this conversation. ( https://en.wikipedia.org/wiki/Scientific_racism )

Anyway we've now accepted blacks and whites may behave differently. For example, lets suppose we have all the training data we need to accurately recognize that one race doesn't pay back their loans as much as others, all else held equal.

What should we do about it? Concretely, how many bad loans should we issue in the name of "fairness"? How large a subsidy must the responsible races pay to the deadbeat ones?

danso 3614 days ago

I don't know if I nor Dr. King Jr. have to subscribe to scientific racism just because we subscribe to the reality that folks with of different racial backgrounds have a higher probability of being shortchanged historically. And thus, that any machine learning approach that doesn't factor this in will risk perpetuating such disadvantages, which kind of defeats the ostensible purpose for using machine learning to apply public policy in the first place.

yummyfajitas 3614 days ago

Historically isn't the issue. The issue is a simple factual question of whether, all else held equal, black people repay their loans at the same rate as whites in identical financial circumstances. The fact that in aggregate financial circumstances might be different isn't important to this question.

If they do, then you don't need to worry about algorithms discriminating. Insofar as they do it's merely a sampling error (i.e. shrinks like O(1/sqrt(N)), where N = Nwhite + Nblack) and they are just as likely to discriminate in favor as against.

If they don't, then you subscribe to scientific racism, or the belief that blacks and whites in identical circumstances behave fundamentally differently.

(I describe these different cases in explicit detail here: https://www.chrisstucchio.com/blog/2016/alien_intelligences_... )

So do you believe race affects reality independent of other factors? And assuming you do subscribe to scientific racism, what should we do about it?

danso 3614 days ago

> The issue is a simple factual question of whether, all else held equal, black people repay their loans at the same rate as whites in identical financial circumstances

Oh if you put it that way, then I don't know. Because that's not the reality that's being dealt with, in which whites and blacks have identical circumstances. I think you're reading something into this that others aren't.

yummyfajitas 3614 days ago

Because that's not the reality that's being dealt with, in which whites and blacks have identical circumstances.

Of course it is. There may be 5 blacks and 100 whites with a credit score of 830. But as long as blacks and whites with an 830 credit score behave the same, then data from whites will generalize to blacks and the problem tlb brought up doesn't apply. Redundant encoding is also irrelevant - this is useless information so an accuracy maximizer has no reason to pay any attention.

Insofar as blacks and whites with an 830 credit score behave differently, then algorithms might treat them differently. That's the "race realism" hypothesis.

closed 3614 days ago

FWIW, I present a case where race does not directly cause increased failure to repay, but common approaches to modeling could discriminate against race.

These issues have been discussed in detail in statistical considerations of Simpson's paradox. One need not accept that racial differences directly affect an outcome of interest, in order to be concerned about a model being biased against race!

skybrian 3614 days ago

I don't know what you mean by "fundamentally different" but there are definitely going to be demographic differences that the algorithm could use to predict race with good probability from hidden variables. (Where they live, for example.) History has an influence that's hard to remove from the dataset.

I'd guess that another reason this problem is hard is that it's about defining the goal correctly. It's not just maximizing repayment. There is some fairness goal that isn't well-defined.

yummyfajitas 3614 days ago

By "fundamentally different", I mean that the most accurate model will be something like this:

    repayment_probability = 1 x downpayment_frac + 0.5 x credit_score + A x isBlack

for some A != 0. I.e., if A = -0.2, then a black borrower with a 60% downpayment is as likely to pay back a loan as a white borrower with a 40% downpayment.

If A = 0, then the bias described by tlb and danso won't occur.

What you describe with hidden variables is called "redundant encoding", and it's just a way of recovering the `A x isBlack` term if you remove `isBlack` from your input set. But if blacks and whites repay their loans at the same rate (holding all else equal), redundant encoding won't happen - it doesn't actually improve accuracy.

I describe this in more detail here: https://www.chrisstucchio.com/blog/2016/alien_intelligences_...

I agree with you that the core issue is an unspecified true goal. Folks are unwilling to publicly and explicitly state how many bad loans should be issued for fairness or how many unqualified students should be allowed into college for diversity.

Or for an example closer to home, how much we should lower the bar in order to hire more non-Asian minorities in tech? Daring to ask that question gets you some pretty hostile responses.

skybrian 3614 days ago

Repayment rates are not just individual - they also depend on the financial strength of friends and family who can help you out if you get in trouble. So, I think we have to assume that there are performance-relevant differences that an algorithm will detect.

Also, unless the dataset has information about families, this isn't based on your actual family. It's based on the average benefit people like you get from their family.