| HN Mirror

We formulate SGNS word2vec as a distributed graph problem, where nodes are all unique tokens (the dictionary) in the corpus and edges are defined by skipgrams. For skipgram (w_in, w_center), there will be an edge from w_in to w_center.

Tokens are randomly distributed over a set of workers. Each worker iterates over its edges in parallel with all other workers and performs the appropriate computation.

Drawing negative samples is done in two steps. We first draw a worker W from a suitable distribution over the workers and then draw a word from W. The overall word sampling is the same as for the reference implementation (ie, unigram distribution raised to 3/4.)

This work will soon be made public [1].

[1] Stergios Stergiou, Zygimantas Straznickas, Rolina Wu and Kostas Tsioutsiouliklis, ``Distributed Negative Sampling for Word Embeddings''. AAAI 2017.