| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jimsimmons 937 days ago
	What is the best way to deduplicate a corpus of documents

2 comments

If you mean in the sense of dealing with documents that are very similar but not binary identical, a locality sensitive hash would do the job.

A Rabin finger printing algorithm and a hash of the data it generates.

Reference count the hashes.