|
|
|
|
|
by utunga
3954 days ago
|
|
Hi! Great work. I guess my question is - do you use 'averaging' of word vectors or the Chinese Restaurant process - to get to sub reddit vectors. You describe the Chinese Restaurant process as a "more sophisticated method" that you "can" use, but in my experiments with word2vec and reddit (https://github.com/utunga/gensimred) I quickly discovered that simple averaging just does not work. Averaging has this awful 'revert to mean' thing that turns all the paragraph vectors into a sort of bland gray goo where they are all the same. If you did use Chinese Restaurant process (I love that phrase - brings back memories of an occasion at a Dim Sum restaurant where this almost literally happened) it'd be great to see any source code you may feel like releasing ;_) ... well, it can't hurt to ask.. |
|
The Paragraph Vector approach can give interesting results for document-similarity, including similarity after certain 'algebraic'-like additions/subtractions of other topics/word-concepts. [2]
[1] http://arxiv.org/abs/1405.4053
[2] http://arxiv.org/abs/1507.07998