Hacker News new | ask | show | jobs
Deep Learning at x.ai (x.ai)
78 points by dfkoz 3745 days ago
4 comments

From my understanding, they ran word2vec [1] on their email dataset. Anyone can run word2vec on any dataset with a single desktop machine. What I don't get is why word2vec is not mentioned?

Edit: the mentioned algorithm is t-SNE -- which seems to be another algorithm for dimension reduction. I don't know how it compares to word2vec

[1] for instance, https://www.tensorflow.org/versions/r0.7/tutorials/word2vec/...

[2] https://lvdmaaten.github.io/tsne/

Although the visualization is similar to what you might see from a word2vec demo, they haven't run word2vec here. There are many ways to generate word vectors, word2vec is one, but the method used here was a Recurrent Neural Network (RNN). More specifically, the type of RNN was a Long Short Term Memory Network (LSTM). Since word vectors can have very high dimensionality (in this case, the dimension was 50), this makes them difficult to visualize. The t-sne algorithm reduces dimensionality to the point where you can visualize the initial vectors and still compare different data points to some useful extent.
They didn't run word2vec. They built a LSTM-RNN (Long Short Term Memory Recursive Neural Network). They mention this in the caption of the image showing word clusters.

word2vec and LSTM-RNN both produce word embeddings, which are vector representations of words. They then applied t-SNE, which is a dimensionality reduction technique designed to produce nicely separated 2 dimensional clusters from any high dimensional data. It can do this for any "type" of vector, not just word embeddings.

So, word2vec and LSTM-RNN both make high dimensional vectors out of words. t-SNE takes high dimensional vectors and makes them 2 dimensional.

word2vec is an algorithm to produce meaningful "word embeddings", which is a vector representation in a usually high-dimensional space. t-SNE is a dimensionality-reduction algorithm. Both can be used together, as they serve different purposes.
One could argue that Word embeddings are also dimensionality-reduction techniques: Words live in an infinite dimensional space, and the embeddings is a finite-dimensional projection of this infinte dimensional space.
I think of word vectors in the opposite light. Words stored in a dictionary have 1-dimension (their index), making comparisons more or less random. Word vectors augment the information you have about a word by continually examining the context that the word appears in a corpus of text.
In fact, the typical way to visualize word2vec embeddings is t-SNE.
> A RNN makes predictions based on sequential data. When a RNN is trained on sequences of words, it learns to represent each word as a high dimensional vector which encodes the model’s understanding of that word. By projecting these high dimensional vectors into a two dimensional space, it’s possible to visualize their relationships and glean insights into the concepts that the model has learned.

It sounds like what's being visualized are the probability vectors that the model creates, which are usually a value for each possible class (noun, verb, etc. in this case). If this is the case, I don't see how the t-sne visualization is much more useful than a confusion matrix. Typically prior to training, words are translated from dictionary indexes into word embeddings (high dimensional vectors, where dimension is >> than the number of classes) that let you compare them and do vector algebra like "king + queen - woman = man". You can visualize the word embeddings, and color code them by class after training to see if there are any sorts of patterns in your word embeddings.

> The RNN learned all of this semantic understanding without a human ever having to code a definition of concepts like nouns, verbs, universities, cities, meetings, or social media. This is the power of deep learning algorithms.

Was this an unsupervised approach? If so, that seems a little unusual for Part of Speech Tagging (POS tagging). I suppose the author could mean that the model was used to label Out of Vocabulary (OOV) words, aka words that never appeared in the training set. Labeling OOV data points is sort of the general benefit of machine learning, and I'm not sure can be attributed solely to Deep Learning. The main benefit I've gleaned from Deep Learning is that it automates the feature engineering phase of the machine learning pipeline.

There are lots of good resources for RNNs, LSTMs, Word Embeddings and t-sne out there from Stanford, NYU, Theano, TensorFlow, and the like. Here's a blog post that gives some background if you're interested: http://colah.github.io/posts/2014-07-NLP-RNNs-Representation...

Hello. This is Adam. I trained the model and made the visualization. Thanks for your comments.

This model is not a POS tagger. The model was trained to predict the next word in the email given the preceding words. So in that sense, it's similar to the word2vec models discussed in the link you shared. However for this work I used a recurrent neural network to learn a language model of the emails in our database.

After training, I extracted the learned word vectors from the model (they are the weights that connect the input layer which uses a one-hot-encoding of vocab words to the embedding layer). I then used the t-SNE algorithm to reduce the dimensionality of the learned word vectors and then plotted them in 2 dimensions. The colors representing the parts of speech were added after the fact to show that the model had learned to distinguish between nouns, verbs, etc.

Thanks Adam! It's nice work and it seems like there's a pretty epic dataset to analyze at x.ai. My main confusion was what the visualized vectors represented, but I guess you've answered that by saying they're the first layer in your model (if I'm interpreting correctly). What I don't quite understand is how you got the word vector from the inputs. It sounds like you represent each word as a one hot encoding (similar to indexes), and then you pass this one hot encoding through the first layer giving you the word vector for each input?
That's right. The weights that connect the Nth neuron in the one-hot input layer to the embedding layer can be thought of as a vector encoding of the Nth word in the vocabulary.
How does the recurrent neural network technique compare to the CBOW technique in word2vec? CBOW would've been the first thing I tried.
I agree that's an interesting comparison to make but I'm not sure of the answer. The original purpose of this work was not to generate word vectors but rather to evaluate whether we have enough data to start using deep learning algorithms. That an RNN trained on our data was able to learn word vectors with a significant amount of structure seems like a positive sign. But I don't know how the quality of these word vectors would compare to vectors generated by more standard word2vec algorithms.
There are tons of ways to evaluate word vector quality! Word analogy tasks, word similarity tasks, contextual prediction tasks, etc.

This link contains a bunch of relevant evaluation datasets and benchmarks obtained using word2vec, GloVe, etc. You can evaluate your RNN-learned vectors and compare them to a traditionally trained word2vec-trained vectors. Link here: http://www.bigdatalab.ac.cn/benchmark/bm/Domain?domain=Word%...

For more background on evaluating word vectors check out these pretty great lecture notes from Socher's NLP class: http://cs224d.stanford.edu/lecture_notes/LectureNotes2.pdf

Also, here's the original papers from a few years ago that introduced many of these datasets and evaluation standards:

https://papers.nips.cc/paper/5021-distributed-representation...

http://www.cs.cmu.edu/~mfaruqui/papers/acl14-vecdemo.pdf

You could read this nicely written review to get more info about RNN as a starting point http://arxiv.org/abs/1506.00019.
Implementation wise, did you train it with one of the widespread python libs or opted in to one the scala(nlp) frameworks? If the latter, I'd be interested which for LSTM worked for you (factorie more probabalistic, mllib afaik no good for compute graphs, d4j / sparkling water).
This work was done in Python using Theano and Keras.
This article is the ML equivalent of Hello World.
So where does my MNIST classifier fall? The ML equivalent of opening a text editor?
It's the FizzBuzz.
Hello World is not trivial if you are writing it in java.
Not a great analogy. It's still one line of code; just a very stupid one line of code. [System.out.println("Hello World"))]
It's not even a stupid line of code - it's just a line of code which happens to namespace things which most other languages don't.

System is a global object which handles various pieces of the runtime system. One variable on this object is `out` (of type PrintStream) which represents stdout. The `println` method of PrintStream actually does the work.

I get why this seems overly verbose in comparison to Python and similar languages, but I don't think it's stupid. It's just emphasizing explicitness and uniformity over brevity.

This seems like a poor attempt at getting free press.
If that's the case then it worked and I applaud their effort. I would have liked to hear about how they're using neural networks in system. I wrote a good-enough system that can handle similar things with scheduling meetings but it's an expert system, not neural net so I'm curious how they use them if at all.
Marcos here, ds at x.ai ... happy to answer that question in person. If you are around NYC, feel free to pass by our offices for a coffee/chat :-)
Ha, I'd love to but unlikely I'll be around NYC any time soon.