Hacker News new | ask | show | jobs
by cs702 2791 days ago
Based on a first glance, this looks like fabulous work. I've added it to my reading list. Thank you for sharing it!

The first question that comes to mind is whether this property and its implications might hold for deep "contextualized" word embeddings such as ELMo[a], which, as I'm sure you're aware, have proven superior to "shallow" word embeddings like Word2Vec/SGNS and GloVe in a growing range of NLP tasks.

A deep contextualized word embedding model maps words like "leaves" very differently depending on context. For example, the deep contextualized vector for the word "leaves" in the sentence "In the Fall, children love to play in the leaves" will be closer to the vector for "foliage" than to the vector for "leaves" in the sentence "Children don't like it when their father leaves for work," which will be closer to the vector for "departs."

I strongly suspect the csPMI property and its implications would hold for the pair (vector("leaves"), vector("foliage")) in the first case and for the pair (vector("leaves"), vector("departs")) in the second case.

What are your (speculative) thoughts on this?

[a] https://allennlp.org/elmo

1 comments

Interesting question! I think you have the right idea: the GloVe or SGNS vector for a word is some composition of the word sense representations. The number of senses for a word isn't necessarily finite either -- one could argue that each possible context a word could appear in denotes a unique word sense.

I suspect that ELMo (and others) work by mapping a word vector to a word sense vector conditioned on the context, which is much larger than what is used in shallow embeddings like GloVe. If GloVe and SGNS are implicitly factorizing word-context matrices containing a co-occurrence statistic like PMI, then ELMo might be implicitly factorizing a (word sense)-(context sense) matrix containing the same co-occurrence statistic. If this is true, then we'd expect the csPMI property to hold at the word sense-level as well. It'd be much harder to prove though, due to the relative complexity of ELMo compared to GloVe/SGNS.

Thank you. Yes, exactly, that's my sense too (pun intended) :-)

Naturally, I'm wondering whether it might be possible somehow to approximate a (pretrained) ELMo (or similar) model with two simpler transformations: first a transformation to the space of word-sense compositions (e.g., via GloVe/SGNS), and then a transformation to a space that somehow encodes probabilities over word senses given the context. Hmmm...

There may be merit in tagging each of the words with their part of speech prior to fitting the model in a similar way to sense2vec. Using your example above you would then have 2 vectors, one for leaves|VERB and one for leaves|NOUN