|
|
|
|
|
by yummyfajitas
3568 days ago
|
|
Did you read what I wrote? Available data on blacks specifically is completely irrelevant if blacks and whites aren't fundamentally different. The white model will generalize. If repayment probability for blacks and whites alike is is A x downpayment_fraction + B x credit_score, you can use training data from whites and the model will accurately predict black repayment probability. It only fails if you actually need A' and B' for blacks. As an example, maybe for whites A = 1.0 and for blacks A' = 0.75. In that case the optimal decision is to demand higher lending standards for blacks - a black person with a 40% downpayment would be treated the same as a white person with a 30% downpayment. Is this your belief? |
|
For example, suppose that (1) people can be green or blue, (2) green people tend to live in Idaho, (3) living in Idaho is associated with people not paying back loans.
A linear model where there are only non-zero, positive coefficients for the path p(green) -> p(Idaho) -> p(fail_to_repay), and p(credit_score) -> p(fail_to_repay) will create trouble, even though color does not directly affect repayment. If you use a multiple regression with fail_to_repay ~ B0 + B1Idaho + B2credit_score, it will discriminate against green people, by penalizing people from Idaho.
AFAIK, one of the points of the paper linked in the parent comment is that blindly using indicators like IP address may indirectly lead to discrimination against a racial group in this way, e.g. p(racial_group) -> p(a_specific_IP_address).
Maybe more relevant to your example, though, is that assuming whites and blacks have the same model in the "ground-truth" scenario I presented could cause a model to be discriminative (when it shouldn't be, because the coefficient for the path from p(green) -> p(fail_to_repay) is 0).
This specific issue is hairy, and exists for traditional approaches also.