| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by yummyfajitas 3549 days ago

tOne approach is to directly model the corruption process. Being the model-based-Bayesian guy I am, this is something I like to do.

But if your model is sufficiently expressive you don't need to explicitly build or model the corruption process. In the example in my linked blog post, test scores might be biased against blacks. But race is also redundantly encoded, so the algorithm has enough information to fix the bias completely by accident.

Fundamentally what I'm saying here is that bias is a statistics problem and has a statistics solution. Insofar as your complaint is algorithms finding the wrong answer, the solution is better stats.

And nothing whatsoever that I've said here would be remotely controversial if the topic were remote sensing.

1 comments

srean 3549 days ago

> But if your model is sufficiently expressive you don't need to explicitly build or model the corruption process

This is the claim that I am having trouble with.

Say I have two random variable X,Y with some joint distribution. If a corruption process can mess with the samples drawn from it, I cannot see how it could possibly recover either the joint or the conditional.

Are you saying that the corruption is benign like missing at random or missing completely at random ? Then its much more believable.

link

yummyfajitas 3549 days ago

So we both agree that if the bias is linear, and your model is linear, you capture it. Similarly if the model involves interaction (score x is_black), and you include linear interaction terms, you'll also capture it.

Now the question arises; what if things are more complex?

In real life they always are; both your biasing factor and the rest of the model. So we've cooked up all sorts of fun models like SVMs, random forests and neural networks to analyze such complicated models and find hidden features and relations that we didn't think of. Bias is one such feature.

If I built an algorithm that learned to display different ads to mobile and desktop people (i.e., treat mobile "time on site" differently from desktop "time on site"), would you be surprised by this?

link

srean 3549 days ago

That makes it clearer. I got thrown off by the claim that a standard algorithm will be able to de-bias if no de-biasing machinery has been built into it. BTW the machinery may be implicit in the choice of the model.

Simple toy example: say Y is a threshold function of X + high variance noise. I draw samples from this and scale down all y_i's that exceed the (unknown) threshold. In other words my corruption process is dependent on X. We can make it depend on Y too. These would require explicit modeling. Just throwing a uniformly rich class of P(X,Y) wont by itself fix this. We have to carve that space of P(X,Y) with the knowledge of possible corruption process to get a good model of the behavior before the corruption is applied.

BTW we have gone way off tangent, but that was a good conversation.

link