Y
Hacker News
new
|
ask
|
show
|
jobs
by
jimsimmons
937 days ago
What is the best way to deduplicate a corpus of documents
2 comments
marginalia_nu
937 days ago
If you mean in the sense of dealing with documents that are very similar but not binary identical, a locality sensitive hash would do the job.
link
OnlyMortal
937 days ago
A Rabin finger printing algorithm and a hash of the data it generates.
Reference count the hashes.
link