Hacker News new | ask | show | jobs
by closed 3574 days ago
Even in models where race doesn't directly cause an outcome, a model's judgements may be biased against a race.

For example, suppose that (1) people can be green or blue, (2) green people tend to live in Idaho, (3) living in Idaho is associated with people not paying back loans.

A linear model where there are only non-zero, positive coefficients for the path p(green) -> p(Idaho) -> p(fail_to_repay), and p(credit_score) -> p(fail_to_repay) will create trouble, even though color does not directly affect repayment. If you use a multiple regression with fail_to_repay ~ B0 + B1Idaho + B2credit_score, it will discriminate against green people, by penalizing people from Idaho.

AFAIK, one of the points of the paper linked in the parent comment is that blindly using indicators like IP address may indirectly lead to discrimination against a racial group in this way, e.g. p(racial_group) -> p(a_specific_IP_address).

Maybe more relevant to your example, though, is that assuming whites and blacks have the same model in the "ground-truth" scenario I presented could cause a model to be discriminative (when it shouldn't be, because the coefficient for the path from p(green) -> p(fail_to_repay) is 0).

This specific issue is hairy, and exists for traditional approaches also.

1 comments

If I understand your model right, you are saying that Idahoans don't repay loans and your model accurately reflects this. This isn't a bias at all. The model is issuing fewer loans to green people not because they are green but because they live in Idaho and are unlikely to pay back said loans.

This is a case like what is described in the article - when a perfect predictor (another word for this is "reality" or "hindsight") will still exhibit disparate impact.

It is a bias if you calculate the cost to people taking out loans, based on color. Green people will pay a higher cost, even though in the ground-truth model their race is not directly related to loan repayment.

For example, if only blue people in Idaho fail to repay loans, green people will still absorb a greater cost in the multiple regression case above (in the sense that they are more likely to be penalized for being Idahoans).

Yes, if it's actually (blue & Idaho) ~> default, and your model ignores blue, then the greens will pay a higher cost. If color is redundantly encoded then your model can partially fix this and penalize the blue's in Idaho.

Do you consider this situation unjust? If so, you might be unhappy to learn that the entire goal of the field of algorithmic fairness is to do something along these lines.

> Available data on blacks specifically is completely irrelevant if blacks and whites aren't fundamentally different. The white model will generalize.

I should have been more clear that I was responding to this part of your comment. That even if blacks and whites aren't fundamentally different (in the sense that your race does not directly cause an outcome of interest) you can produce biases that are essentially a misatrribution about the relationship between race and that outcome. Worse, if there_is_ a relationship you can reverse the direction a model estimates for the relationship (Simpson's paradox).

> Do you consider this situation unjust? If so, you might be unhappy to learn that the entire goal of the field of algorithmic fairness is to do something along these lines.

I don't think the creation of tools to accommodate this specific purpose is bad, per se. Whether or not they are the appropriate tool to use is a different question.