| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mrdrozdov 3745 days ago

> A RNN makes predictions based on sequential data. When a RNN is trained on sequences of words, it learns to represent each word as a high dimensional vector which encodes the model’s understanding of that word. By projecting these high dimensional vectors into a two dimensional space, it’s possible to visualize their relationships and glean insights into the concepts that the model has learned.

It sounds like what's being visualized are the probability vectors that the model creates, which are usually a value for each possible class (noun, verb, etc. in this case). If this is the case, I don't see how the t-sne visualization is much more useful than a confusion matrix. Typically prior to training, words are translated from dictionary indexes into word embeddings (high dimensional vectors, where dimension is >> than the number of classes) that let you compare them and do vector algebra like "king + queen - woman = man". You can visualize the word embeddings, and color code them by class after training to see if there are any sorts of patterns in your word embeddings.

> The RNN learned all of this semantic understanding without a human ever having to code a definition of concepts like nouns, verbs, universities, cities, meetings, or social media. This is the power of deep learning algorithms.

Was this an unsupervised approach? If so, that seems a little unusual for Part of Speech Tagging (POS tagging). I suppose the author could mean that the model was used to label Out of Vocabulary (OOV) words, aka words that never appeared in the training set. Labeling OOV data points is sort of the general benefit of machine learning, and I'm not sure can be attributed solely to Deep Learning. The main benefit I've gleaned from Deep Learning is that it automates the feature engineering phase of the machine learning pipeline.

There are lots of good resources for RNNs, LSTMs, Word Embeddings and t-sne out there from Stanford, NYU, Theano, TensorFlow, and the like. Here's a blog post that gives some background if you're interested: http://colah.github.io/posts/2014-07-NLP-RNNs-Representation...

1 comments

adamklec 3745 days ago

Hello. This is Adam. I trained the model and made the visualization. Thanks for your comments.

This model is not a POS tagger. The model was trained to predict the next word in the email given the preceding words. So in that sense, it's similar to the word2vec models discussed in the link you shared. However for this work I used a recurrent neural network to learn a language model of the emails in our database.

After training, I extracted the learned word vectors from the model (they are the weights that connect the input layer which uses a one-hot-encoding of vocab words to the embedding layer). I then used the t-SNE algorithm to reduce the dimensionality of the learned word vectors and then plotted them in 2 dimensions. The colors representing the parts of speech were added after the fact to show that the model had learned to distinguish between nouns, verbs, etc.

link

mrdrozdov 3745 days ago

Thanks Adam! It's nice work and it seems like there's a pretty epic dataset to analyze at x.ai. My main confusion was what the visualized vectors represented, but I guess you've answered that by saying they're the first layer in your model (if I'm interpreting correctly). What I don't quite understand is how you got the word vector from the inputs. It sounds like you represent each word as a one hot encoding (similar to indexes), and then you pass this one hot encoding through the first layer giving you the word vector for each input?

link

adamklec 3745 days ago

That's right. The weights that connect the Nth neuron in the one-hot input layer to the embedding layer can be thought of as a vector encoding of the Nth word in the vocabulary.

link

jdonaldson 3745 days ago

How does the recurrent neural network technique compare to the CBOW technique in word2vec? CBOW would've been the first thing I tried.

link

adamklec 3745 days ago

I agree that's an interesting comparison to make but I'm not sure of the answer. The original purpose of this work was not to generate word vectors but rather to evaluate whether we have enough data to start using deep learning algorithms. That an RNN trained on our data was able to learn word vectors with a significant amount of structure seems like a positive sign. But I don't know how the quality of these word vectors would compare to vectors generated by more standard word2vec algorithms.

link

nicklo 3745 days ago

There are tons of ways to evaluate word vector quality! Word analogy tasks, word similarity tasks, contextual prediction tasks, etc.

This link contains a bunch of relevant evaluation datasets and benchmarks obtained using word2vec, GloVe, etc. You can evaluate your RNN-learned vectors and compare them to a traditionally trained word2vec-trained vectors. Link here: http://www.bigdatalab.ac.cn/benchmark/bm/Domain?domain=Word%...

For more background on evaluating word vectors check out these pretty great lecture notes from Socher's NLP class: http://cs224d.stanford.edu/lecture_notes/LectureNotes2.pdf

Also, here's the original papers from a few years ago that introduced many of these datasets and evaluation standards:

https://papers.nips.cc/paper/5021-distributed-representation...

http://www.cs.cmu.edu/~mfaruqui/papers/acl14-vecdemo.pdf

link

jestinjoy1 3744 days ago

You could read this nicely written review to get more info about RNN as a starting point http://arxiv.org/abs/1506.00019.

link

alex_hirner 3745 days ago

Implementation wise, did you train it with one of the widespread python libs or opted in to one the scala(nlp) frameworks? If the latter, I'd be interested which for LSTM worked for you (factorie more probabalistic, mllib afaik no good for compute graphs, d4j / sparkling water).

link

adamklec 3745 days ago

This work was done in Python using Theano and Keras.

link