Hacker News new | ask | show | jobs
by dang 1317 days ago
I would like to know if any of these techniques could be used for identifying articles that are either copies of each other, or near-copies, or different articles on the same story.
4 comments

The easiest and likely most effective method may be to compute vector embeddings using a sentence transformer model, and find nearest neighbors among these vectors for all articles in the set. The distance between the nearest vectors will give you a degree of similarity between the articles. You'll need to figure out some thresholds on these distances to figure out what are near copies vs different articles on the same story. There are efficient methods to find approximate nearest neighbors among a large set of these vectors - available as both OSS and SaaS. Faiss [1], ScaNN [2], and Pinecone [3] are some examples.

This is one of the methods mentioned in the article. I don't have implementation experience with the other string distance measures in the article (under "normalized string" in the table), except for Q-grams. Compared to the above method Q-grams don't scale as well and are not as robust because it doesn't encapsulate an understanding of the semantics of the text.

[1] github.com/facebookresearch/faiss

[2] github.com/google-research/google-research/tree/master/scann

[3] www.pinecone.io

If looking for exact or near-exact duplicates, a transformer seems like it's probably overkill. Maybe it's not bad if you already have one that you can use for inference in the database, but I suspect that something as simple as Fasttext would do the job. A transformer would probably be more useful if you want to catch things like replacing words with synonyms out of a thesaurus.
Wouldn't that find articles that are semantically similar, rather than structurally similar (which I interpreted GP as wanting)?

In the structural-comparison case, I imagine you might have better luck with just doing cosine similarity across the term frequency vector or some such, possibly doing random projection first to reduce dimensionality.

Or really, an LSH would do the trick.

Yes, see these examples, which frame this as plagiarism detection but effectively do the same thing:

- https://dzone.com/articles/build-a-plagiarism-checker-using-...

- https://www.pinecone.io/learn/plagiarism-detection/

You should be able to use embeddings for this (sort by the cosine similarity). eg OpenAI has an off the shelf offering: https://beta.openai.com/docs/guides/embeddings/what-are-embe...

We used something similar to build a “similar articles” feature & it gave us de-duplication essentially for free.

Exact and near duplicate articles should have similar or identical word frequency distributions. Maybe that can be used as a blocking criterion somehow. Although it might not be any faster to compare word frequency distributions than to compare dense low-dimensional embeddings.