| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mci 3112 days ago
	Sounds like a fun project. However, I doubt if word vectors buy you anything more than, say, old good Nilsimsa from 2001 (https://en.wikipedia.org/wiki/Nilsimsa_Hash). Side note: py-nilsimsa should iterate over Unicode points instead of UTF-8 bytes. As it stands now, the similarity of any texts in the same language using a non-Latin script is ~80 rather than ~0.

1 comments

laretluval 3112 days ago

word2vec has the advantage that you could potentially identify spam messages that are paraphrases rather than exact copies of the ones in the training set.

link

mci 3112 days ago

1. Pedantically: it's GloVe, not word2vec. 2. Nilsimsa or any locality-sensitive hash detect changed messages, too, be the changes synonyms or not. 3. I don't think OP's GloVe contains words like v1agra.

link

doody_parizada 3112 days ago

We don't have words like v1agra. As I mentioned in the README, we took vectors pretrained on wikipedia. One of the possible improvements can be to train the vectors on our own dataset.

link