Hacker News new | ask | show | jobs
by mci 3112 days ago
Sounds like a fun project. However, I doubt if word vectors buy you anything more than, say, old good Nilsimsa from 2001 (https://en.wikipedia.org/wiki/Nilsimsa_Hash). Side note: py-nilsimsa should iterate over Unicode points instead of UTF-8 bytes. As it stands now, the similarity of any texts in the same language using a non-Latin script is ~80 rather than ~0.
1 comments

word2vec has the advantage that you could potentially identify spam messages that are paraphrases rather than exact copies of the ones in the training set.
1. Pedantically: it's GloVe, not word2vec. 2. Nilsimsa or any locality-sensitive hash detect changed messages, too, be the changes synonyms or not. 3. I don't think OP's GloVe contains words like v1agra.
We don't have words like v1agra. As I mentioned in the README, we took vectors pretrained on wikipedia. One of the possible improvements can be to train the vectors on our own dataset.