| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mrow84 3549 days ago
	I don't really understand that claim. You are explicitly adding a bias that is linearly dependant on your race variable, and then allowing your regression to recover that bias by introducing noisy measurements of race (which you as the modeller knew was the thing causing the bias). As you say, that is unsurprising. That result does not, however, address my point, which is that if the structure of the bias is difficult to understand, or perhaps even just difficult to model, and if relevant measurements (with errors that are uncorrelated with your original errors) are unavailable, then bias correction is essentially impossible.

1 comments

yummyfajitas 3549 days ago

The point is that the bias is linear, and my model is linear, so the model fixes things. The example is synthetic (so we could know what the right answer is and check if we recover it) so of course I put everything in.

In the linked article, I explicitly reference a real world case where the same linear model was used to discover that grades and test scores are biased in favor of blacks: http://ftp.iza.org/dp8733.pdf

In more complicated situations, the bias would need to be amenable to detection by a neural network, an SVM or random forest. The entire purpose of models like this is that lots of hidden patterns are detected.

Even if relevant measurements are unavailable, one can use redundant encoding to fix bias. Delip Rao explains redundant encoding here, for example, though he is more concerned that ML models might learn facts he wants to remain hidden: http://deliprao.com/archives/129

link

mrow84 3549 days ago

To remain with the example in your blog post, your model fixed things because the implicit bias model was correct (linear dependance on race), and the data were available, either directly (via the race variable) in the "What if measurements are biased?" section, or indirectly (via the noisy redundantly-encoded race variables) in the "What if we scrub race, but redundantly encode it?" section.

In the first of those two sections you yourself note how bias correction is not possible without the relevant data: "If we scrubbed the data this result would be impossible. Running least squares on scrubbed data yields alpha = [ 0.29878373, 0.30869833] - we can't correct for bias because we don't know the variable being biased on."

I'm not disputing that bias correction is possible, only that it can be much harder than you seem to be implying, with statements like "Most algorithms can and will correct for biases in their inputs.", and "Of course data contains biases. But again, please read the article I linked; algorithms will have a tendency to correct that bias."

I have some experience with bias correction in (ocean) weather forecasting, and in that domain there were problems both with the difficulty of modelling the bias structure, and with obtaining measurements reliable enough for bias correction.

link

yummyfajitas 3549 days ago

What I'm disputing is this:

Machine learning does not have less bias than human researchers. It is simply magnified at scale.

This is fundamentally wrong. Given data on the biasing factor, most algorithms will try to use it and improve things. Sometimes information is unavailable. On net there is a reason why many algorithms will reduce bias, and no particular reason why they would increase it equally in the remaining cases.

link

dragonwriter 3549 days ago

> Given data on the biasing factor, most algorithms will try to use it and improve things.

Unless the bias is in what they are designed to optimize for (either because the goal is explicitly biased or because the operationalization of the goal into a concrete measure is, whether intentionally or not, biased), in which case they will obviously reinforce it.

link