Hacker News new | ask | show | jobs
by gojomo 236 days ago
If by 'doc2vec' you mean the word2vec-like 'Paragraph Vectors' technique: even though that's a far simpler approach than the transformer embeddings, it usually works pretty well for coarse document similarity. Even the famous word2vec vector-addition operations kinda worked, as illustrated by some examples in the followup 'Paragraph Vector' paper in 2015: https://arxiv.org/abs/1507.07998

So if for you the resulting doc-to-doc similarities seemed nonsensical, there was likely some process error in model training or application.