Interesting. There's a big need for better vector representations of things in-between words (for which Word2Vec/Glove/FastText work well) and documents (which to me seems impossible. Yes I know about Doc2Vec etc, but really.. it works ok for paragraphs).
Facebook's InferSent[1] has worked reasonably well for me for a variety of sentence level tasks, but I don't have anything I can point to to say that it is really substantially better than averaging word embeddings.
More options is good.
(Also, is Kurzweil part of Google Brain or separate. He doesn't really have nay background in NLP does he?)
"Also, is Kurzweil part of Google Brain or separate. He doesn't really have nay background in NLP does he?"
From Wikipedia: "Raymond "Ray" Kurzweil (/ˈkɜːrzwaɪl/ KURZ-wyl; born February 12, 1948) is an American author, computer scientist, inventor and futurist. Aside from futurism, he is involved in fields such as optical character recognition (OCR), text-to-speech synthesis, speech recognition technology, and electronic keyboard instruments.... Kurzweil was the principal inventor of... the first print-to-speech reading machine for the blind,[3] the first commercial text-to-speech synthesizer,[4]... and the first commercially marketed large-vocabulary speech recognition."
He's been in the general space of NLP for quite a while.
For the record, good old fashioned bag of words representations (tf-idf, LDA, LSA) still provide useful representations for documents. Obviously we hope to do better, but recently people act like there's no way of turning a document into a vector.
Bag of word representations work fine for some applications.
The reason people want better representations is for the applications where they don’t. For example, Bag of words doesn’t capture agreement or disagree well, whereas better representations can.
1. This is more Technical Report worthy than paper worthy...
2. "by Ray Kurzweil's Team", although accurate I find that fetishization of certain stars to pretty insulting to the other authors, we already have a convention and it's "Cer et al. (2018)"
“We present models for encoding sentences into embedding vectors that specifically target transfer learning to other NLP tasks. The models are efficient and result in accurate performance on diverse transfer tasks. Two variants of the encoding models allow for trade-offs between accuracy and compute resources. For both variants, we investigate and report the relationship between model complexity, resource consumption, the availability of transfer task training data, and task performance. Comparisons are made with baselines that use word level transfer learning via pretrained word embeddings as well as baselines do not use any transfer learning. We find that transfer learning using sentence embeddings tends to outperform word level transfer. With transfer learning via sentence embeddings, we observe surprisingly good performance with minimal amounts of supervised training data for a transfer task. We obtain encouraging results on Word Embedding Association Tests (WEAT) targeted at detecting model bias. Our pre-trained sentence encoding models are made freely available for download and on TF Hub.”
They made a way to take any sentence, and output a small array of numbers that represent its essence. You can use their model to find the essence of your own sentences. And then use it either directly (e.g. compare the essence of two sentences to see if they're saying roughly the same thing) or use it as a starting point for the model you need (e.g. if you're building a system to convert English sentences into French, your neural network might generate the essence of the English sentence as part of its work. By using the pre-trained model, you have a better starting point for that part of the network than just random numbers, so your training time will be greatly reduced).
The array of numbers represents some opaque statistical property of the sentence with respect to the others in the corpus the model was trained from. The hope is that this property will correlate with what we believe to be the sentence's meaning.
They have an algorithm that takes sentences in textual form and produces a different representation of each sentence that (they claim) is easier for certain language-oriented machine learning tasks to work with. Previous work focused on producing that different representation at the word level, but theirs works on complete sentences.
I had been under the impression that you could just feed text into neural nets, and then ... magic!
But, no. As it turns out, the very first problem you encounter when trying to implement ML on text is that you need to transform the text into some set of numbers (the "vectors"), with the elements in the set matching the number of nodes in your input layer.
This is a tricky thing to do. You're essentially trying to "hash" the text in a way which uniquely represents the text you're working with and also gives the neural net something it can operate on. Which is to say, you can't just use a common hashing algorithm, because the neural net won't be able to learn anything from the random output of the hashing algorithm.
There are several different approaches being used for this. One of them, mentioned elsethread, is "bag-of-words", where you build a big dictionary of word-to-number associations and then do some variety of transformations on that. Another is "feature extraction", where you might try to input a value representing properties like the length of the sentence, the number of words, the vocabulary level of the words, and so on. (This would probably be a bad approach for most ML goals on long text.)
so does this work? am i getting redirected back to that page when i click the link because they're checking my user agent? i don't have tf installed on this machine in order to check but does getting the model through the tf api work?
lol, although I might have to take some blame by putting a link in my comment to begin with.
Note: Keep in mind that some folks publish on Arxiv because it is far easier than going through a traditional publication process. As such, you sometimes get not-as-polished works like this, although they might update the article to fix some of those references.
As someone who has done a ML course, did a primitive Word2Vec but doesn't really follow the field all that close - how important is this and how does it compare to what came before?
Facebook's InferSent[1] has worked reasonably well for me for a variety of sentence level tasks, but I don't have anything I can point to to say that it is really substantially better than averaging word embeddings.
More options is good.
(Also, is Kurzweil part of Google Brain or separate. He doesn't really have nay background in NLP does he?)
[1] https://github.com/facebookresearch/InferSent