Hacker News new | ask | show | jobs
by visarga 3162 days ago
Run Doc2Vec (or Word2Vec) on a large corpus of text or download pretrained vectors. To compute a document vector, take a linear combination of the word vectors in the document according to TFIDF. Now that you have vectors for each document, you need to create a fast index with a library called "Annoy". It can do very fast similarity search in vector space for millions of documents. I think this approach works faster than grep and doesn't need to bother with stemming. It will automatically know that "machine learning" and "neural nets" are related, so it does a kind of fuzzy search.
3 comments

If you wanted it to know that "machine learning" and "neural networks" were related, wouldn't you need to do some type of entity extraction first, since Word2vec is run on tokens?
You can use Gensim:

    from gensim.models.phrases import Phrases
    bigrams = Phrases(corpus)
or you could rank bigrams by count(w1+w2)^2/(count(w1)*count(w2))

many variations on this formula work, but the idea is to compare the count of the bigram to the counts of the unigrams.

By the way, you do bigram identification before Word2Vec to have specialized vectors for bigrams as well.

Besides this method, there is one great way to identify ngrams: use Wikipedia titles. It's quite an extended list that covers most of the important named entities, locations and multi-word topic names, or go directly to http://wiki.dbpedia.org/ for a huge list with millions of ngrams. Cross reference it with your text corpus and you get a nice clean list.

The original word2vec source code comes with a probabilistic phrase detection tool. Keyword: word2phrase.
Good to know, thanks!
Alternatively to tf-idf, there's an interesting property in word embeddings generated by word2vec : they're sorted by rarity (the most common words being on top of the list).

So if you insert them in the same order in a database, you can just use their primary key as weight for a word. This also has the advantage of filtering out stop words without any additional processing.

If I understand correctly, this forgoes Lucene entirely. I would really like something that can be integrated into Lucene/Solr due to the availability of all the infrastructure build around it.

> works faster than grep

I didn't quite get the connection to grep.

Suppose you have gigabytes of text, Annoy will find matching articles faster and more precise than grepping with keywords.
Lucene is faster and better than grep too. Annoy may be better than Lucene's "more like this" query which is for finding similar documents in an index to a given set of documents. But how would it be helpful for keyword search which is what is being asked about?
I know, inverted index search is fast, it is the basic search engine algorithm, but there is a difference in quality of top ranked results. With word vectors you can ensure the topic of the whole document is what you want. Many documents mix topics and some keywords appear by mistake in the wrong place, for example, because scraping web text is imperfect and might capture extra text.