|
|
|
|
|
by herrherr
4396 days ago
|
|
I've been trying for weeks now to get a system running that can handle larger than RAM datasets and returns queries in an acceptable time. It's running ok now but far from optimal (size of DB is ~100 GB and it contains a few hundred million entries). Does anyone here have experience with any implementations (such as likelike, lshkit, etc.) and can recommend something that can handle larger sets? All the implementations I have found were either not maintained, old, not running or not suitable for production use. Will definitely take a look at the paper but unfortunately it's always a very long way from here to an actual implementation (there is no code published as far as I could see). |
|
Detecting Near-Duplicates for Web Crawling (http://www.wwwconference.org/www2007/papers/paper215.pdf)
SEOMoz has in-memory and db-backed implementations of simhash in Python (https://github.com/seomoz?query=simhash)