Hacker News new | ask | show | jobs
by cypherpunks01 5141 days ago
What are people using this 'cdb' for, irl? Sounds interesting but this is the first i've heard of it.
7 comments

It's useful if you have data that is easy to cache (i.e., rebuilt every 6 hours) but very commonly accessed. Because the lookups are so quick (two seeks) it's almost raw disk speed. But yeah, rebuilding the files is an offline process (build new file and swap it in using a rename), so your data has to be cache-friendly.

It's a good alternative to memcache if your data is larger than what memcached can support in RAM.

In the early 2000s I used it to implement most of the frontend for a PPC marketplace for search engines. Held up well. These days I'd just use memcached or redis.

It's a good alternative to memcache if your data is larger than what memcached can support in RAM.

Unless you are running memcached on an ec2 large instance (8gb) or bigger:

"No random limits: cdb can handle any database up to 4 gigabytes."

It's pretty easy to throw together a variant using longs for position instead of unsigned ints, the rest of the code stays the same. Slightly more overhead in the file but as long as the items you're storing are bigger than a few bytes it's not a huge deal.

Anyways, it's useful for stuff where you want to ship out a big dictionary once a day or so and you need fast lookup but it doesn't have to be updated transactionally.

I use it on my mail relay to determine whether a domain is local or not. As this check happens for every incoming RCPT TO (including all the bots checking to see whether I'm an open relay or not), I don't want to hit the database where the accounts are stored just yet.

This is the perfect case for a cdb file: the list of local domains changes very rarely, but lookups happen 100s of times per second.

Granted though, it's a very special case. Aside of this mail case, I've yet to find another real world application.

It's a persistent data structure optimized for static lookup tables. Bernstein's qmail and djbdns use it for mail routing and DNS zone data, respectively.
I'm curious to know how it compares to LevelDB (http://code.google.com/p/leveldb/). I'd love to see a detailed comparison between the two. LevelDB is mutable, but is built on top of an immutable key/value map: http://code.google.com/p/leveldb/source/browse/include/level...
LevelDB is a sorted string table; cdb is a hash table. Without knowing anything more about the implementation, I expect that arbitrary lookups are slower in LevelDB than cdb.
It's used extensively in djb's packages like qmail and ucspi-tcp. In the latter, it's used to maintain a database of blacklisted IPs which are blocked from connecting to a TCP server, among other things. http://cr.yp.to/cdb.html
I used it in combination with the public dumps from IMDB to make a quick on-disk lookup database for use with my favorite (albeit slower) scripting language python.

I didn't have the RAM for holding it all in memory and I was only ever going to do read lookups on it.

It made it very easy to query things like: Given my favorite set of movies, which actors appearing in them have also appeared in the Doctor Who television series?

Just for fun / learning :)

Incidentally, I've heard the IMDB site itself uses cdbs heavily to serve static content and data.
DJB's qmail (the security-hardened sendmail replacement) uses it.