Hacker News new | ask | show | jobs
by mrow84 3549 days ago
To correct biased measurements (in a careful way) you need

1. Enough knowledge about the structure of the bias to be able to devise a model for it.

2. Some measurements from which to fit the model, with errors that are uncorrelated with the errors in your original data.

These things are not always easy to obtain, even in relatively mundane settings. It is also a distinctly non-automatic procedure - it requires someone to decide that a bias exists, to model it, obtain the relevant data, and fit the bias correction model, all before they can begin to obtain unbiased (or probably just less-biased) measurements.

1 comments

I'm not making the claim that an algorithm magically fixes everything. I'm claiming that sometimes they do which makes bias less likely to be present in the ML model.

You don't need a human data scientist to decide bias exists, model it and fix it at all. If you read the post I linked to, you can observe a synthetic example of linear regression (with redundant encodings) accidentally fixing bias.

So yes, if your model is expressive enough and you have sufficient data, it will automatically fix bias. Is it really shocking that an algorithm which is good at finding hidden patterns will find a hidden pattern?

I don't really understand that claim. You are explicitly adding a bias that is linearly dependant on your race variable, and then allowing your regression to recover that bias by introducing noisy measurements of race (which you as the modeller knew was the thing causing the bias). As you say, that is unsurprising.

That result does not, however, address my point, which is that if the structure of the bias is difficult to understand, or perhaps even just difficult to model, and if relevant measurements (with errors that are uncorrelated with your original errors) are unavailable, then bias correction is essentially impossible.

The point is that the bias is linear, and my model is linear, so the model fixes things. The example is synthetic (so we could know what the right answer is and check if we recover it) so of course I put everything in.

In the linked article, I explicitly reference a real world case where the same linear model was used to discover that grades and test scores are biased in favor of blacks: http://ftp.iza.org/dp8733.pdf

In more complicated situations, the bias would need to be amenable to detection by a neural network, an SVM or random forest. The entire purpose of models like this is that lots of hidden patterns are detected.

Even if relevant measurements are unavailable, one can use redundant encoding to fix bias. Delip Rao explains redundant encoding here, for example, though he is more concerned that ML models might learn facts he wants to remain hidden: http://deliprao.com/archives/129

To remain with the example in your blog post, your model fixed things because the implicit bias model was correct (linear dependance on race), and the data were available, either directly (via the race variable) in the "What if measurements are biased?" section, or indirectly (via the noisy redundantly-encoded race variables) in the "What if we scrub race, but redundantly encode it?" section.

In the first of those two sections you yourself note how bias correction is not possible without the relevant data: "If we scrubbed the data this result would be impossible. Running least squares on scrubbed data yields alpha = [ 0.29878373, 0.30869833] - we can't correct for bias because we don't know the variable being biased on."

I'm not disputing that bias correction is possible, only that it can be much harder than you seem to be implying, with statements like "Most algorithms can and will correct for biases in their inputs.", and "Of course data contains biases. But again, please read the article I linked; algorithms will have a tendency to correct that bias."

I have some experience with bias correction in (ocean) weather forecasting, and in that domain there were problems both with the difficulty of modelling the bias structure, and with obtaining measurements reliable enough for bias correction.

What I'm disputing is this:

Machine learning does not have less bias than human researchers. It is simply magnified at scale.

This is fundamentally wrong. Given data on the biasing factor, most algorithms will try to use it and improve things. Sometimes information is unavailable. On net there is a reason why many algorithms will reduce bias, and no particular reason why they would increase it equally in the remaining cases.

> Given data on the biasing factor, most algorithms will try to use it and improve things.

Unless the bias is in what they are designed to optimize for (either because the goal is explicitly biased or because the operationalization of the goal into a concrete measure is, whether intentionally or not, biased), in which case they will obviously reinforce it.