| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by kawin 2791 days ago

Hi, first author here! Feel free to ask any questions.

TL;DR: We prove that linear word analogies hold over a set of ordered pairs (e.g., {(Paris, France), (Ottawa, Canada), ...}) in an SGNS or GloVe embedding space with no reconstruction error when PMI(x,y) + log p(x,y) is the same for every word pair (x,y). We call this term the csPMI (co-occurrence shifted PMI). This has a number of interesting implications:

1. It implies that Pennington et al. (authors of GloVe) had the right intuition about why these analogies hold.

2. Adding two word vectors together to compose them makes sense, because you're implicitly downweighting the more frequent word -- like TF-IDF or SIF would do explicitly.

3. Using Euclidean distance to measure word dissimilarity make sense because the Euclidean distance is a linear function of the negative csPMI.

1 comments

cs702 2791 days ago

Based on a first glance, this looks like fabulous work. I've added it to my reading list. Thank you for sharing it!

The first question that comes to mind is whether this property and its implications might hold for deep "contextualized" word embeddings such as ELMo[a], which, as I'm sure you're aware, have proven superior to "shallow" word embeddings like Word2Vec/SGNS and GloVe in a growing range of NLP tasks.

A deep contextualized word embedding model maps words like "leaves" very differently depending on context. For example, the deep contextualized vector for the word "leaves" in the sentence "In the Fall, children love to play in the leaves" will be closer to the vector for "foliage" than to the vector for "leaves" in the sentence "Children don't like it when their father leaves for work," which will be closer to the vector for "departs."

I strongly suspect the csPMI property and its implications would hold for the pair (vector("leaves"), vector("foliage")) in the first case and for the pair (vector("leaves"), vector("departs")) in the second case.

What are your (speculative) thoughts on this?

[a] https://allennlp.org/elmo

link

kawin 2791 days ago

Interesting question! I think you have the right idea: the GloVe or SGNS vector for a word is some composition of the word sense representations. The number of senses for a word isn't necessarily finite either -- one could argue that each possible context a word could appear in denotes a unique word sense.

I suspect that ELMo (and others) work by mapping a word vector to a word sense vector conditioned on the context, which is much larger than what is used in shallow embeddings like GloVe. If GloVe and SGNS are implicitly factorizing word-context matrices containing a co-occurrence statistic like PMI, then ELMo might be implicitly factorizing a (word sense)-(context sense) matrix containing the same co-occurrence statistic. If this is true, then we'd expect the csPMI property to hold at the word sense-level as well. It'd be much harder to prove though, due to the relative complexity of ELMo compared to GloVe/SGNS.

link

cs702 2791 days ago

Thank you. Yes, exactly, that's my sense too (pun intended) :-)

Naturally, I'm wondering whether it might be possible somehow to approximate a (pretrained) ELMo (or similar) model with two simpler transformations: first a transformation to the space of word-sense compositions (e.g., via GloVe/SGNS), and then a transformation to a space that somehow encodes probabilities over word senses given the context. Hmmm...

link

thesehands 2791 days ago

There may be merit in tagging each of the words with their part of speech prior to fitting the model in a similar way to sense2vec. Using your example above you would then have 2 vectors, one for leaves|VERB and one for leaves|NOUN

link