Hacker News new | ask | show | jobs
by skybrian 3568 days ago
I don't know what you mean by "fundamentally different" but there are definitely going to be demographic differences that the algorithm could use to predict race with good probability from hidden variables. (Where they live, for example.) History has an influence that's hard to remove from the dataset.

I'd guess that another reason this problem is hard is that it's about defining the goal correctly. It's not just maximizing repayment. There is some fairness goal that isn't well-defined.

1 comments

By "fundamentally different", I mean that the most accurate model will be something like this:

    repayment_probability = 1 x downpayment_frac + 0.5 x credit_score + A x isBlack
for some A != 0. I.e., if A = -0.2, then a black borrower with a 60% downpayment is as likely to pay back a loan as a white borrower with a 40% downpayment.

If A = 0, then the bias described by tlb and danso won't occur.

What you describe with hidden variables is called "redundant encoding", and it's just a way of recovering the `A x isBlack` term if you remove `isBlack` from your input set. But if blacks and whites repay their loans at the same rate (holding all else equal), redundant encoding won't happen - it doesn't actually improve accuracy.

I describe this in more detail here: https://www.chrisstucchio.com/blog/2016/alien_intelligences_...

I agree with you that the core issue is an unspecified true goal. Folks are unwilling to publicly and explicitly state how many bad loans should be issued for fairness or how many unqualified students should be allowed into college for diversity.

Or for an example closer to home, how much we should lower the bar in order to hire more non-Asian minorities in tech? Daring to ask that question gets you some pretty hostile responses.

Repayment rates are not just individual - they also depend on the financial strength of friends and family who can help you out if you get in trouble. So, I think we have to assume that there are performance-relevant differences that an algorithm will detect.

Also, unless the dataset has information about families, this isn't based on your actual family. It's based on the average benefit people like you get from their family.