Hacker News new | ask | show | jobs
by Pengy7 2796 days ago
> If so, even if you don't include gender as a feature itself, your outputs may end up being biased (in the technical sense) by gender.

Part of the problem is people using the same word to mean multiple things. For instance, "bias" has a precise mathematical definition in the context of statistics: https://en.wikipedia.org/wiki/Bias_(statistics) . And this sentence makes no sense with that definition. In fact, with linear models it is mathematically impossible to make a "worse" model (in terms of mean squared error) by including more variables (like gender, age, race, etc...).

> you can't be "blind" to race, color, religion, and gender

Also I am not sure that this train of thought actually leads to where we want to go. A perfect model isn't necessarily blind to these features, a perfect model treats everyone as an individual.

1 comments

> In fact, with linear models it is mathematically impossible to make a "worse" model (in terms of mean squared error) by including more variables (like gender, age, race, etc...).

That's only true if you mean the mean squared error on the training data, which is not usually a good indicator of model quality. Instead you should use the mean squared error on test data, which gets worse if you add non-predictive variables to the input.

If there are non-predictive variables, the linear model with the lowest expected square error should assign exactly zero weight to them, equivalent to the situation where those variables don't exist. But training on a finite sample, that "exactly zero" outcome is extremely unlikely (as in, the probability is 0) if the non-predictive variables vary at all. That variance allows identifying individual data points, even though the relationship is completely random and doesn't help generalize to unseen data. In other words, the model overfits to noise.