| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by vicaya 6236 days ago
	Sorry, but 500MB DB size is a tiny dataset these days (anything < 1GB is tiny, < 4GB is small, < RAM on a single node (~8GB-64GB) is medium, < Disks on a single node (~128GB to a few TB) is large, huge dataset requires multiple nodes and typically above 128TBs.)

1 comments

psadauskas 6236 days ago

I agree, plus TC has a ton of parameters that can be tweaked, and the defaults are pretty small. The one that has the most pronouced effect is the bucket size, or the "width" of the hash table. The bigger, the less chance of collisions, which means you have to follow a linked-list to find the exact record. He used 11M keys, so a bnum in the range of 40M would be much quicker.

I benchmarked TC b+tree on a 1TB db with ~350M keys, and it worked great. I would publish the numbers, but I'm embarrassed that they aren't very rigorous.

cdb docs say it has a limit of 4GB, which makes it pretty much worthless for anything I would use it for.

link

rcoder 6236 days ago

Actually, the core CDB data structure only limits keys and values to 4GB each, not in total:

http://www.unixuser.org/~euske/doc/cdbinternals/index.html

The hash algorithm used also only produces a 32-bit key, meaning you'll be limited to 2^32 total records. Again, though, unless your data is of trivial size, that gives you considerably more room to work with than a hard 4GB limit.

Edit: doh! can't use double-asterisk for exponent on HN

link

epi0Bauqu 6236 days ago

Then why does the Perl implementation (http://search.cpan.org/~msergeant/CDB_File-0.96/CDB_File.pm) warn of the error:

CDB database too large -- You attempted to create a cdb file larger than 4 gigabytes.

link

asb 6236 days ago

I believe rcoder is saying the 4GB database limit is a property of the implementation rather than the cdb format.

link

jhy 6236 days ago

Do you have a pointer to any more info on tuning TC? (Apart from the API guide)

link

zandorg 6236 days ago

I guess someone could tweak cdb to be >4GB, but these C libraries are always insanely convoluted.

link

rcoder 6236 days ago

> ...these C libraries are always insanely convoluted...

If you haven't actually looked at the code, you might want to avoid making such a overly-general statement. CDB is a very simple data structure (basically a two-stage hash table) serialized to disk in a format that makes lookup fast. You can check the link I posted above to see a simple explanation of the format, and this page to see examples for usage (from an API-compatible reimplementation):

http://www.corpit.ru/mjt/tinycdb.html

Since the core algorithm is so simple, creating a 64-bit version should be similarly easy, at least on a UNIX-like system (trying to run code designed by Dan Bernstein on a non-UNIX system would be...interesting).

link

zandorg 6236 days ago

Sorry, yes, I didn't look at the code. I was thinking of early Berkeley DB code.

link