| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jknz 3745 days ago

From my understanding, they ran word2vec [1] on their email dataset. Anyone can run word2vec on any dataset with a single desktop machine. What I don't get is why word2vec is not mentioned?

Edit: the mentioned algorithm is t-SNE -- which seems to be another algorithm for dimension reduction. I don't know how it compares to word2vec

[1] for instance, https://www.tensorflow.org/versions/r0.7/tutorials/word2vec/...

[2] https://lvdmaaten.github.io/tsne/

3 comments

mrdrozdov 3745 days ago

Although the visualization is similar to what you might see from a word2vec demo, they haven't run word2vec here. There are many ways to generate word vectors, word2vec is one, but the method used here was a Recurrent Neural Network (RNN). More specifically, the type of RNN was a Long Short Term Memory Network (LSTM). Since word vectors can have very high dimensionality (in this case, the dimension was 50), this makes them difficult to visualize. The t-sne algorithm reduces dimensionality to the point where you can visualize the initial vectors and still compare different data points to some useful extent.

link

bglazer 3745 days ago

They didn't run word2vec. They built a LSTM-RNN (Long Short Term Memory Recursive Neural Network). They mention this in the caption of the image showing word clusters.

word2vec and LSTM-RNN both produce word embeddings, which are vector representations of words. They then applied t-SNE, which is a dimensionality reduction technique designed to produce nicely separated 2 dimensional clusters from any high dimensional data. It can do this for any "type" of vector, not just word embeddings.

So, word2vec and LSTM-RNN both make high dimensional vectors out of words. t-SNE takes high dimensional vectors and makes them 2 dimensional.

link

cjauvin 3745 days ago

word2vec is an algorithm to produce meaningful "word embeddings", which is a vector representation in a usually high-dimensional space. t-SNE is a dimensionality-reduction algorithm. Both can be used together, as they serve different purposes.

link

jknz 3745 days ago

One could argue that Word embeddings are also dimensionality-reduction techniques: Words live in an infinite dimensional space, and the embeddings is a finite-dimensional projection of this infinte dimensional space.

link

mrdrozdov 3745 days ago

I think of word vectors in the opposite light. Words stored in a dictionary have 1-dimension (their index), making comparisons more or less random. Word vectors augment the information you have about a word by continually examining the context that the word appears in a corpus of text.

link

josh11b 3745 days ago

In fact, the typical way to visualize word2vec embeddings is t-SNE.

link