Hacker News new | ask | show | jobs
by visarga 3154 days ago
You can use Gensim:

    from gensim.models.phrases import Phrases
    bigrams = Phrases(corpus)
or you could rank bigrams by count(w1+w2)^2/(count(w1)*count(w2))

many variations on this formula work, but the idea is to compare the count of the bigram to the counts of the unigrams.

By the way, you do bigram identification before Word2Vec to have specialized vectors for bigrams as well.

Besides this method, there is one great way to identify ngrams: use Wikipedia titles. It's quite an extended list that covers most of the important named entities, locations and multi-word topic names, or go directly to http://wiki.dbpedia.org/ for a huge list with millions of ngrams. Cross reference it with your text corpus and you get a nice clean list.