Hacker News new | ask | show | jobs
by zintinio5 3134 days ago
Like everything else, depends on your use-case. I have personally used TF-IDF vectors and token sets with Cosine and Jaccard distances in practice.

Some examples of use-cases: are you searching for "semantically similar", or "near duplicate"? You can compare documents under different metrics and different _representations_. Some representations are: LSA, PLSA, LDA, TF-IDF, and Set representations, along with metrics such as Jaccard Distance, Cosine Distance, Euclidean distance, etc.

Doc2vec is the Word2vec analog for documents.