| HN Mirror

I've been working on doc2vec stuff recently.

The statement Le and Mikolov's "Distributed Representations of Sentences and Documents", frequently cited as the original example of "doc2vec", could not be reproduced by Mikolov himself. is an overstatement - there was only one part that couldn't be completely reproduced.

It's true that Quoc Le's results on the dmpv version of doc2vec have been hard to reproduce. However, the very stackexchange link you cite above points out that it can be reproduced by not shuffling the data. It's likely that this was an oversight.

However - and it's an important thing - the reason this example gets some attention is because doc2vec is a very strong model even in dbow form.

here's an IBM research paper that leads and concludes with "we reimplemented doc2vec and made it work well"

No, they took the Gensim doc2vec implementation and experimented with parameters on different datasets[1].

Also, Mikolov's Word2Vec work was even more important than doc2vec and was fully reproducible and was released with code and trained models, while at Google.

[1] https://github.com/jhlau/doc2vec