Hacker News new | ask | show | jobs
by hosay123 4825 days ago
I'm not sure it's fair to call this benchmark unfair.. LMDB enables use cases that simply aren't possible with non-zero-copy technologies. For example imagine a database full of large objects, and an HTTP/1.0 server that supports an API returning offsets from within these objects.

In a situation like this it is possible (and efficient) to say, store 500mb-sized files in LMDB and have the first and only copy of file content be done by the kernel as it faults and copies relevant portions of file data directly into a skbuff (e.g. by passing (value_pointer + offset) directly to write(), etc.).

Another use case is streamy decompression. Why read or fault an entire value into memory if the reading client disconnects after consuming only the first kilobyte?

In any database where copying the entire record value occurs first, designs like the above simply aren't feasible.

1 comments

These are very niche use cases. Doing an entire benchmark on that basis without making that clear is misleading. And even in that exact use case, you will not see the massive performance difference that they are showing, because even in that case the data will have to be read from disk by the kernel, whereas in their benchmark the disk isn't even touched. Sure, in a database that doesn't support this, it might read the data from disk to memory, and then write it to a buffer, and that's going to be a bit less efficient than letting the kernel do it directly. You have an extra in memory copy, but that's still far from the performance difference that they are showing between disk read vs no disk read at all. So even in this ideal scenario, they are still massively overstating their performance (which should already be clear from the fact that no disk has 3 terabytes per second throughput). Designs like that are absolutely feasible in any database. You just have to do your own chunking. Storing 500MB records in a database is a bad idea anyways, since you never know if you're going to trip over some logic in the database, your application, or the kernel, or some interaction between these, that's going to load or copy it all unnecessarily. Better to make the chunking explicit.
Better to make everything as simple and as easy as possible. App-managed chunking is error-prone, and it forces redundancy of implementations. Likewise, "you have an extra in memory copy" can make or break a system. OpenLDAP slapd using LMDB uses 1/4 the RAM of slapd using BerkeleyDB for a given number of entries. That means you can serve 4x as many entries using LMDB on a given hw config at full RAM speed, vs BDB. LMDB efficiency has a direct impact on hardware cost and electricity cost, as well as sw development cost.
Exactly! So do real world fair benchmarks that show 4x improvement, not unfair benchmarks that literally show 3000x improvement over BDB.
We are doing all of the above. Microbenchmarks that show only the performance of the DB engine still serve a purpose. And the difference between the microbenches and the full benches shows you how much overhead is eaten in the rest of the system. E.g., when testing read performance of SQLite3 adapted to LMDB, we measured only a 1-2% improvement, because the majority of SQLite3 read execution time is spent in SQL processing, not in its Btree code.
Here's some more real world testing, done by the folks at Zimbra:

http://wiki.zimbra.com/wiki/OpenLDAP_MDB_vs_HDB_performance

Clearly more than 4x write speed improvement.

Nice! Do you know why that benchmark show almost no read improvement but massive write improvement? The slides here (http://symas.com/mdb/20120322-UKUUG-MDB.pdf) suggest the opposite? Or am I reading them incorrectly?
As I understand it, that test used something like 32 server threads on a machine with only 4 cores. The UKUUG prezo used 16 server threads on a machine with 16 cores, which allowed all of the reads to run concurrently. The smaller number of CPU cores in the Zimbra test naturally reduces the concurrency advantage of LMDB.
I believe the test runs with 100 byte values, and the key is stored on the same page as the value. Therefore in order to perform a lookup the entire page must still be faulted in, and the key must still be compared, so the benchmark is not fake in that regard. Data is getting pulled off disk, just not in a way that requires multiple copies. Note the library's tiny code size (and zero copy design) has another benefit, in that it can greatly reduce cache pollution, which may again help account for the speed difference.

> Storing 500MB records in a database is a bad idea anyways, since you never know if you're going to trip over some logic in the database, your application, or the kernel, or some interaction between these, that's going to load or copy it all unnecessarily. Better to make the chunking explicit.

So basically you're saying there is no benefit because you aren't used to systems that support it.

> Therefore in order to perform a lookup the entire page must still be faulted in, and the key must still be compared, so the benchmark is not fake in that regard. Data is getting pulled off disk, just not in a way that requires multiple copies.

How did you get this information? If that's the case, why did they not actually use the values (e.g. do something trivial like compute accumulated XOR or something). That would still show the advantages of the zero-copy design, but not ridiculously so.

It still seems highly suspicious. For example the benchmark with 100kb values runs twice as fast as the benchmark with 100 byte values?!

I certainly agree that the zero-copy design can be good, but these benchmarks are vastly overstating it to the point that the benchmarks are pretty useless. With these benchmarks it's impossible to know if that database is actually faster than the others. It could well be faster, but it could also be slower. To know that you need a fair comparison.

Note also that memory mapped designs have disadvantages as well. For example AFAIK memory mapping doesn't work asynchronously. You will block your threads, so you can't use lightweight threads or callbacks but you'll have to use heavyweight threads (which may limit concurrency and be expensive in general). Unless you implement your own OS of course (which interestingly some people are doing: http://www.openmirage.org/).

> So basically you're saying there is no benefit here because you aren't used to systems that effortlessly support it.

I'm not saying there is no benefit, I'm saying that it's a bad idea to rely on it that way for 500mb because for it to work safely every component has to support the same. Too easy to have an unexpected problem somewhere and have it blow up by copying 500mb when you only wanted 1kb. But it's still valuable for a database to handle large values well of course. Anyway, this is beside the main point.

Still nonsense. "AFAIK memory mapping doesn't work asynchronously" - clearly you don't know much. Reader threads in LMDB never block (except for pagefaults to pull data in, which are obviously unavoidable).
> except for pagefaults to pull data in, which are obviously unavoidable

Umm, obviously that's what I mean. And with async IO instead of memory mapping, you could avoid that. Hence why I said memory mapping has disadvantages. Clearly the "nonsense" was invented at your end.

Eh. You're still talking nonsense. If an app requests some data and it needs to be read in from disk, you have to wait. Async I/O or not, that app can't make any progress until its data arrives. Async I/O is only advantageous for writes.
I use it in a private project. For comparison, here's the difference between touch-all-bytes and touch-no-bytes using the Python binding:

    random lookup 1000000 buffers in 1.82sec (550928/sec)
    random lookup+hash 1000000 buffers in 2.52sec (397189/sec)
This was using 100 byte values.
Could it be that your test is dominated by Python speed, given that their tests show 15-30 million operations per second but your test only half a million? Or is their hardware so much faster?
> Doing an entire benchmark on that basis without making that clear is misleading.

A benchmark which provides source code and test data should never be categorized as "misleading." At worst it puts too much trust in the average reader's understanding of things. But look at it this way: if you littered every page that included a benchmark with all the relevant caveats, it'd be a total mess.

Or as David Simon says, "fuck the average viewer."