|
|
|
|
|
by gwern
2832 days ago
|
|
> There is no trade-off. Note that the accuracy of sentiment prediction went up when we switched to ConceptNet Numberbatch. Some people expect that fighting algorithmic racism is going to come with some sort of trade-off. There’s no trade-off here. You can have data that’s better and less racist. You can have data that’s better because it’s less racist. There was never anything “accurate” about the overt racism that word2vec and GloVe learned. The big conclusion here after all that code buildup does not logically follow. All it shows is that one new word embedding, trained by completely different people for different purposes with different methods on different data using much fancier semantic structures, outperforms (by a small and likely non-statistically-significant degree) an older word embedding (which is not even the best such word embedding from its batch, apparently, given the choice to not use 840B). It is entirely possible that the new word embedding, trained the same minus the anti-bias tweaks, would have had still superior results. |
|
I that there is a real statistical pattern in the training data that names associated with certain ethnicities are more likely to appear close to words with negative sentiment. I just don't think this necessarily means that the news is racist. I think more analysis is needed to see where this pattern comes from.
However, if it is true that the news is biased and racist in a quantifiable way, that would be a bigger problem than biased word vectors. I would genuinely be interested in seeing that type of analysis.