Hacker News new | ask | show | jobs
by rossdavidh 2832 days ago
I think I would agree. You otherwise run the risk of having fixed the metric ("Italian" vs. "Mexican", "Chad" vs. "Shaniqua", etc.) without actually fixing the underlying issue.

Also, regarding black/white etc., there might legitimately be words which have so many different meanings (whether race-related or not) that you should just exclude them from sentiment analysis. "Right" can mean like "human rights", "right thing to do", or "not left". Probably plenty of other words like that. You might do better to have a list of 100-200 words that are just excluded because of issues like that.

2 comments

> there might legitimately be words which have so many different meanings

I haven't studied word embeddings past the pop-sci level but wouldn't such words form multiple clusters in the embedding space? I would have thought it would be relatively easy to get different 'words' for 'right (entitlement)', 'right (direction)', etc?

Edit: Nibling post answers this question.

Would it be worth trying to think of words with different meanings as entirely new words? So, "white" in one sentence may be a different word than "white" in another?
There's a long list of papers on that - 'multi-sense word embeddings'. But more recently we have found that passing the raw character embeddings through a two layer BiLSTM will resolve the ambiguity of meaning from context - 'ElMO'.

https://arxiv.org/abs/1802.05365 (state of the art)