|
|
|
|
|
by saintarian
1313 days ago
|
|
The easiest and likely most effective method may be to compute vector embeddings using a sentence transformer model, and find nearest neighbors among these vectors for all articles in the set. The distance between the nearest vectors will give you a degree of similarity between the articles. You'll need to figure out some thresholds on these distances to figure out what are near copies vs different articles on the same story. There are efficient methods to find approximate nearest neighbors among a large set of these vectors - available as both OSS and SaaS. Faiss [1], ScaNN [2], and Pinecone [3] are some examples. This is one of the methods mentioned in the article. I don't have implementation experience with the other string distance measures in the article (under "normalized string" in the table), except for Q-grams. Compared to the above method Q-grams don't scale as well and are not as robust because it doesn't encapsulate an understanding of the semantics of the text. [1] github.com/facebookresearch/faiss [2] github.com/google-research/google-research/tree/master/scann [3] www.pinecone.io |
|