Hacker News new | ask | show | jobs
by rdtsc 4607 days ago
Howard, is LMDB effectively limited to 128T (on 64bit machines and 2GB on 32bit ones, not that one should be running large databases on 32bit machines)?

Also what about concurrent writes? Does it have a database wide writer lock or is it per key (per page?) ?

1 comments

It is limited to the logical address space. Since most current x86-64 machines have only 48bit address space, 256TB, and assuming the kernel keeps half of the space for itself, then yes, the current limit is 128TB. But I suspect we'll be seeing 56bit address spaces fairly soon.

It is a single-writer DB, one DB-wide writer lock. Fine-grained locking is a tar pit.

Fine-grained locking is hard, but "tar pit" is unfair and honestly a bad attitude. It's crucial for modern applications, and it can be done if you're careful, and it can be done really well.

We (Tokutek) tried for a long time to get by with a big monolithic lock, and a) competing with InnoDB was really hard since they do concurrent writers really really well, and b) when we did decide to break up the lock, it wasn't as hard as we thought it would be and it worked really really well.

Don't get discouraged, break up that lock!

In our own workloads, writers are always going after the same pages in their index updates, which inevitably led to deadlocks in BerkeleyDB. As a result, we get much higher throughput with fully serialized writers than with "concurrent" writers. A microbench might show greater concurrency on simple write tasks, but in a real live system with elaborate schema, there's no payoff for us.

As always, you have to profile your workload and see where the delays and bottlenecks really are. Taking a single mutex instead of continuously locking/unlocking all over the place was a win for us.

Is this the reason for your observation that LMDB is oriented towards read workloads?

I can see how the extra code locking/concurrency code would expand the library size out of the CPU cache, though.

Yes, since readers don't require any locks at all and don't issue any blocking calls of any kind - syscalls, malloc, whatever - they run completely unimpeded. The moment you introduce fine-grained locks of any kind the overall performance (reads and writes) will decrease by at least an order of magnitude because readers will have to deal with lock contention.
Makes sense.

Most impressive about LMDB to me is the zero-copy model for readers, with is no extra memcpy needed, maybe that is something obvious for database gurus but it is pretty clever trick I think.

It's pretty significant, yes. Eliminating multiple copies of everything got us a 4:1 reduction in memory footprint in OpenLDAP slapd (compared to our BerkeleyDB-based backend). This is another reason we don't spend too much time worrying about data compression and I/O bound workloads - when you've essentially expanded your available space by a factor of 4, you get the same benefits of compression, without wasting any of the memory or CPU time. And when you can fit a 4x larger working set into your space, you find that you need a lot less actual I/Os.
If I can pluck your brain for a little, do you think LMDB would be a good option as a back end for time series analysis?
I'm sorry, I'm not familiar enough with the workload to answer that. If you're primarily doing sequential writes, it seems like it could work well for it.