Hacker News new | ask | show | jobs
by herrherr 4396 days ago
I've been trying for weeks now to get a system running that can handle larger than RAM datasets and returns queries in an acceptable time. It's running ok now but far from optimal (size of DB is ~100 GB and it contains a few hundred million entries).

Does anyone here have experience with any implementations (such as likelike, lshkit, etc.) and can recommend something that can handle larger sets? All the implementations I have found were either not maintained, old, not running or not suitable for production use.

Will definitely take a look at the paper but unfortunately it's always a very long way from here to an actual implementation (there is no code published as far as I could see).

2 comments

Google's simhash paper shows how to do 8 billion 64bit fingerprints in memory:

Detecting Near-Duplicates for Web Crawling (http://www.wwwconference.org/www2007/papers/paper215.pdf)

SEOMoz has in-memory and db-backed implementations of simhash in Python (https://github.com/seomoz?query=simhash)

Simhash is indeed wicked fast.

Unfortunately, it's also encumbered by a patent: http://www.google.com/patents/US7158961

I've been playing with an implementation on top of lightning mdb[1]. Your profile doesn't have an email but feel free to email me if you're interested.

[1] http://symas.com/mdb/

Actually I'm also using lmdb (together with Python/numpy) :) Added an email address to my profile, would be happy to exchange some experiences.
100GB isn't that big a deal. If you have at least 16GB of RAM it should be a breeze. There are much larger data sets in OpenLDAP in production around the world.

But I wouldn't choose python for large scale data processing work. The python CPU/memory overhead is like 100:1, compared to C. (This is why I worked on rtorrent and ditched the original bittorrent client ASAP, and why I hate bitbake....)

First of all, thanks for open sourcing lmdb :)

The biggest problem currently is actually degrading performance, although I'm almost 100% sure that this isn't caused by lmdb itself, but rather by the bindings I've tried.

In the end, doing it directly in C is probably the only thing that will actually work.