Hacker News new | ask | show | jobs
by kawin 2791 days ago
Interesting question! I think you have the right idea: the GloVe or SGNS vector for a word is some composition of the word sense representations. The number of senses for a word isn't necessarily finite either -- one could argue that each possible context a word could appear in denotes a unique word sense.

I suspect that ELMo (and others) work by mapping a word vector to a word sense vector conditioned on the context, which is much larger than what is used in shallow embeddings like GloVe. If GloVe and SGNS are implicitly factorizing word-context matrices containing a co-occurrence statistic like PMI, then ELMo might be implicitly factorizing a (word sense)-(context sense) matrix containing the same co-occurrence statistic. If this is true, then we'd expect the csPMI property to hold at the word sense-level as well. It'd be much harder to prove though, due to the relative complexity of ELMo compared to GloVe/SGNS.

1 comments

Thank you. Yes, exactly, that's my sense too (pun intended) :-)

Naturally, I'm wondering whether it might be possible somehow to approximate a (pretrained) ELMo (or similar) model with two simpler transformations: first a transformation to the space of word-sense compositions (e.g., via GloVe/SGNS), and then a transformation to a space that somehow encodes probabilities over word senses given the context. Hmmm...

There may be merit in tagging each of the words with their part of speech prior to fitting the model in a similar way to sense2vec. Using your example above you would then have 2 vectors, one for leaves|VERB and one for leaves|NOUN