|
|
|
|
|
by kawin
2791 days ago
|
|
Interesting question! I think you have the right idea: the GloVe or SGNS vector for a word is some composition of the word sense representations. The number of senses for a word isn't necessarily finite either -- one could argue that each possible context a word could appear in denotes a unique word sense. I suspect that ELMo (and others) work by mapping a word vector to a word sense vector conditioned on the context, which is much larger than what is used in shallow embeddings like GloVe. If GloVe and SGNS are implicitly factorizing word-context matrices containing a co-occurrence statistic like PMI, then ELMo might be implicitly factorizing a (word sense)-(context sense) matrix containing the same co-occurrence statistic. If this is true, then we'd expect the csPMI property to hold at the word sense-level as well. It'd be much harder to prove though, due to the relative complexity of ELMo compared to GloVe/SGNS. |
|
Naturally, I'm wondering whether it might be possible somehow to approximate a (pretrained) ELMo (or similar) model with two simpler transformations: first a transformation to the space of word-sense compositions (e.g., via GloVe/SGNS), and then a transformation to a space that somehow encodes probabilities over word senses given the context. Hmmm...