Hacker News new | ask | show | jobs
by saintarian 1313 days ago
The easiest and likely most effective method may be to compute vector embeddings using a sentence transformer model, and find nearest neighbors among these vectors for all articles in the set. The distance between the nearest vectors will give you a degree of similarity between the articles. You'll need to figure out some thresholds on these distances to figure out what are near copies vs different articles on the same story. There are efficient methods to find approximate nearest neighbors among a large set of these vectors - available as both OSS and SaaS. Faiss [1], ScaNN [2], and Pinecone [3] are some examples.

This is one of the methods mentioned in the article. I don't have implementation experience with the other string distance measures in the article (under "normalized string" in the table), except for Q-grams. Compared to the above method Q-grams don't scale as well and are not as robust because it doesn't encapsulate an understanding of the semantics of the text.

[1] github.com/facebookresearch/faiss

[2] github.com/google-research/google-research/tree/master/scann

[3] www.pinecone.io

2 comments

If looking for exact or near-exact duplicates, a transformer seems like it's probably overkill. Maybe it's not bad if you already have one that you can use for inference in the database, but I suspect that something as simple as Fasttext would do the job. A transformer would probably be more useful if you want to catch things like replacing words with synonyms out of a thesaurus.
Wouldn't that find articles that are semantically similar, rather than structurally similar (which I interpreted GP as wanting)?

In the structural-comparison case, I imagine you might have better luck with just doing cosine similarity across the term frequency vector or some such, possibly doing random projection first to reduce dimensionality.

Or really, an LSH would do the trick.