|
|
|
|
|
by visarga
3162 days ago
|
|
Run Doc2Vec (or Word2Vec) on a large corpus of text or download pretrained vectors. To compute a document vector, take a linear combination of the word vectors in the document according to TFIDF. Now that you have vectors for each document, you need to create a fast index with a library called "Annoy". It can do very fast similarity search in vector space for millions of documents. I think this approach works faster than grep and doesn't need to bother with stemming. It will automatically know that "machine learning" and "neural nets" are related, so it does a kind of fuzzy search. |
|