Hacker News new | ask | show | jobs
by rdtsc 4607 days ago
Well LevelDB is already good. And if this improves on it, that's great.

I was looking at embedded key value stores and also found -- HyperLevelDB (from creators of Hyperdex database). They also improved on LevelDB in respect to compaction and locking:

http://hyperdex.org/performance/leveldb/

So now I am curios how it would compare.

Another interesting case optimized for reads is LMDB. That is a small but very fast embedded database at sits at the core of OpenLDAP. That one has impressive benchmarks.

http://symas.com/mdb/microbench/

(Note: LMDB used to be called MDB, you might know it by that name).

2 comments

The LMDB statistics are very strange - why is synchronous SSD performance worse on most figures than HDD performance? Something seems very wrong with these benchmarks:

    Section 5 (SSD) F (Synchronous Writes)
    
    Random Writes
    
    LevelDB              342 ops/sec	
    Kyoto TreeDB          67 ops/sec	
    SQLite3              114 ops/sec	
    MDB                  148 ops/sec	
    MDB, no MetaSync     322 ops/sec	
    BerkeleyDB           291 ops/sec	
    
    Section 8 (HDD) F (Synchronous Writes)
    
    Random Writes
    
    LevelDB             1291 ops/sec	
    Kyoto TreeDB          28 ops/sec	
    SQLite3              112 ops/sec	
    MDB                  297 ops/sec	
    BerkeleyDB           704 ops/sec	
    
Really? LevelDB is four times faster on an HDD than an SSD with synchronous writes? BerkeleyDB is over twice as fast?

This smells.

I would guess the answer is SSD Write Amplification. SSDs in order to write have to erase first. They also try to minimize wear so internally they spread the data around as it gets written. Maybe someone else with more experience can explain more, but that's my guess.
Keep in mind, the HDD was using ext2 and the SSD was using reiserfs. Synchronous writes on ext2 are faster than all journaling filesystems.
Not three orders of magnitude faster, which is the difference between hdd and ssd random writes.
All of the source code is available on that page, you're welcome to rerun it on your own hardware configuration.
Three orders of magnitude faster would mean 1000x faster. You probably meant 3 times faster.
SSDs really are 1000x faster at random writes (~200,000 iops vs ~200 iops)
> The LMDB statistics are very strange - why is synchronous SSD performance worse on most figures than HDD performance?

Could it be that most database engines are based on algorithms that were developed before SSDs were significant, and were extremely optimized for HDD performance?

No. Most database engines are actually so old they were developed even before HDD hit the seek wall.
Paging hyc_symas. Howard Chu isn't shy about talking about benchmark results and he can be found here and on twitter.
Hello hello! Pretty sure what actually happened in these is that the HDD's internal cache was still active, while the Crucial M4 SSD has no internal cache. The only other explanation is that I screwed up my partition offset on the SSD but I already double-checked that and the partitions were all 2MB aligned.
I'm going to compile up LMDB and bench it on a 96GB DL380g8 with quad 3TB ioDrive-2s. Should be interesting to see how various database sizes play out, and what the write amp looks like. I am not seeing much about LMDB's NUMA awareness -- guess I need to keep digging.
For reads we get linear scaling out to 64 cores. Using cache-aligned data structures plays a big part in that for NUMA. (At the moment that's the largest machine we have in our lab.) For writes, there's basically no scaling. Write amplification is logN, proportional to tree height.
If HDD's internal cache was active how are synchronous writes still faster? Shouldn't the flush/sync ensure persistence of data?

If the cache was being used, the HDD results are not actually synchronous, a power loss event would result in data loss.

I'm not sure but I think HDD are better at sequential operations than SSD (which performs better at random operations). Some people say MySQL performs better on a high speed HDD than on a SSD.
LSMs have a long long way to go to catch up to LMDB. http://symas.com/mdb/hyperdex/
For reads, sure. LSMs are optimized for writes, while LMDB, which is a nice B-tree implementation is optimized for reads.

LSMs are getting popular because it's harder to scale durable writes than reads, which can be handled (in many cases independently) by caching.

LSMs are solving a problem that is rapidly becoming irrelevant due to the multiple NVRAM technologies entering the market. With NVDIMMs, MRAM, FeRAM, PRAM, etc., all your writes can be durable for free. And if you'll notice in that HyperDex benchmark I posted, the LSM write performance was still worse than LMDB, while wasting several times more CPU.

Going forward, power efficiency will still be crucial - for as long as civilization persists. But optimizing durable writes will be about as useful as optimizing delay loops.

Howard, is LMDB effectively limited to 128T (on 64bit machines and 2GB on 32bit ones, not that one should be running large databases on 32bit machines)?

Also what about concurrent writes? Does it have a database wide writer lock or is it per key (per page?) ?

It is limited to the logical address space. Since most current x86-64 machines have only 48bit address space, 256TB, and assuming the kernel keeps half of the space for itself, then yes, the current limit is 128TB. But I suspect we'll be seeing 56bit address spaces fairly soon.

It is a single-writer DB, one DB-wide writer lock. Fine-grained locking is a tar pit.

Fine-grained locking is hard, but "tar pit" is unfair and honestly a bad attitude. It's crucial for modern applications, and it can be done if you're careful, and it can be done really well.

We (Tokutek) tried for a long time to get by with a big monolithic lock, and a) competing with InnoDB was really hard since they do concurrent writers really really well, and b) when we did decide to break up the lock, it wasn't as hard as we thought it would be and it worked really really well.

Don't get discouraged, break up that lock!

In our own workloads, writers are always going after the same pages in their index updates, which inevitably led to deadlocks in BerkeleyDB. As a result, we get much higher throughput with fully serialized writers than with "concurrent" writers. A microbench might show greater concurrency on simple write tasks, but in a real live system with elaborate schema, there's no payoff for us.

As always, you have to profile your workload and see where the delays and bottlenecks really are. Taking a single mutex instead of continuously locking/unlocking all over the place was a win for us.

Is this the reason for your observation that LMDB is oriented towards read workloads?

I can see how the extra code locking/concurrency code would expand the library size out of the CPU cache, though.

Makes sense.

Most impressive about LMDB to me is the zero-copy model for readers, with is no extra memcpy needed, maybe that is something obvious for database gurus but it is pretty clever trick I think.

It's pretty significant, yes. Eliminating multiple copies of everything got us a 4:1 reduction in memory footprint in OpenLDAP slapd (compared to our BerkeleyDB-based backend). This is another reason we don't spend too much time worrying about data compression and I/O bound workloads - when you've essentially expanded your available space by a factor of 4, you get the same benefits of compression, without wasting any of the memory or CPU time. And when you can fit a 4x larger working set into your space, you find that you need a lot less actual I/Os.
If I can pluck your brain for a little, do you think LMDB would be a good option as a back end for time series analysis?