| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by gwern 2832 days ago
	> There is no trade-off. Note that the accuracy of sentiment prediction went up when we switched to ConceptNet Numberbatch. Some people expect that fighting algorithmic racism is going to come with some sort of trade-off. There’s no trade-off here. You can have data that’s better and less racist. You can have data that’s better because it’s less racist. There was never anything “accurate” about the overt racism that word2vec and GloVe learned. The big conclusion here after all that code buildup does not logically follow. All it shows is that one new word embedding, trained by completely different people for different purposes with different methods on different data using much fancier semantic structures, outperforms (by a small and likely non-statistically-significant degree) an older word embedding (which is not even the best such word embedding from its batch, apparently, given the choice to not use 840B). It is entirely possible that the new word embedding, trained the same minus the anti-bias tweaks, would have had still superior results.

2 comments

ma2rten 2832 days ago

I also disagree with the conclusion, but for a different reason. I think it's unlikely that the word embeddings were just lower quality. That should result in noise, not bias.

I that there is a real statistical pattern in the training data that names associated with certain ethnicities are more likely to appear close to words with negative sentiment. I just don't think this necessarily means that the news is racist. I think more analysis is needed to see where this pattern comes from.

However, if it is true that the news is biased and racist in a quantifiable way, that would be a bigger problem than biased word vectors. I would genuinely be interested in seeing that type of analysis.

link

bo1024 2831 days ago

Note though that "the news is racist" is different from "the model we learned (from the news) is racist". Maybe the first can be false while the second is true.

link

skybrian 2832 days ago

I think you're reading this statement as more general than it's meant to be? I interpret it as meaning that there is not necessarily any tradeoff, as there wasn't in this case. "You can have data" -> there exists.

link

gwern 2832 days ago

> I interpret it as meaning that there is not necessarily any tradeoff, as there wasn't in this case.

They haven't shown that there is no tradeoff, either in general or in this case.

link

guywhocodes 2832 days ago

Is there anyone who thinks that the current level of racism is required for the current accuracy? I can't imagine people that racist to be common in the data community

link

AnthonyMouse 2832 days ago

> Is there anyone who thinks that the current level of racism is required for the current accuracy? I can't imagine people that racist to be common in the data community

It depends on two things. The first is how you're defining racism. If the algorithm is predicting that 10% of white people and 30% of black people will do X, because that is what actually happens, some people will still call that racism but there is no possible way to change it without reducing accuracy.

If the algorithm is predicting that 8% of white people and 35% of black people will do X even though the actual numbers are 10% and 30%, then the algorithm has a racial bias and it is possible to both reduce racism and increase accuracy. But it's also still possible to do the opposite.

One way to get the algorithm to predict closer to 10% and 30% is to get better data, e.g. take into account more factors that represent the actual cause of the disparity and just happen to correlate with race, so factoring them out reduces the bias and improves accuracy in general.

The other way is to anchor a pivot on race and push on it until you get the results you want, which will significantly harm accuracy in various subtle and not so subtle ways all over the spectrum because what you're really doing is fudging the numbers.

link

nnnnnande 2831 days ago

"If the algorithm is predicting that 10% of white people and 30% of black people will do X, because that is what actually happens, some people will still call that racism but there is no possible way to change it without reducing accuracy."

What is actually happening? Does it tell you if they are they doing X precisely because they are black or white? The racist part might not be the numbers per se, but in the conclusion that the color of their skin has anything to do with their respective choices.

edit: spelling

link

TeMPOraL 2831 days ago

ML is spitting out correlations, not an explicit causal model. If, in reality, X is only indirectly and accidentally correlated with race, but I look at the ML result and conclude the skin color has something to do with X, then the only racist element in the whole system is me.

link

nnnnnande 2831 days ago

Agreed. That was the point I was trying to get at, albeit I might not have phrased it as clearly.

link