Hacker News new | ask | show | jobs
by jimsimmons 937 days ago
What is the best way to deduplicate a corpus of documents
2 comments

If you mean in the sense of dealing with documents that are very similar but not binary identical, a locality sensitive hash would do the job.
A Rabin finger printing algorithm and a hash of the data it generates.

Reference count the hashes.