Hacker News new | ask | show | jobs
by espeed 4397 days ago
Google's simhash paper shows how to do 8 billion 64bit fingerprints in memory:

Detecting Near-Duplicates for Web Crawling (http://www.wwwconference.org/www2007/papers/paper215.pdf)

SEOMoz has in-memory and db-backed implementations of simhash in Python (https://github.com/seomoz?query=simhash)

1 comments

Simhash is indeed wicked fast.

Unfortunately, it's also encumbered by a patent: http://www.google.com/patents/US7158961