Hacker News new | ask | show | jobs
by hyc_symas 4675 days ago
Yes, I changed the key format to allow using the MDB_APPEND option for bulk loading. (That's only usable in LMDB for sequential inserts.) Otherwise, for random inserts, things will be much slower. (Again, refer to the microbench to see the huge difference this makes.) If you don't have your data ordered in advance then this comparison is invalid, and we'd have to just refer to the much slower random insert results.

Still don't understand what happened to sparkey at 100M. The same thing happens using snappy, and the compressed filesize is much smaller than LMDB's, so it can't be pagecache exhaustion.

Also suspicious of the actual time measurements. Both of these programs are single-threaded so there's no way the CPU time measurement should be greater than the wall-clock time. I may take a run at using getrusage and gettimeofday instead, these clock_gettime results look flaky.

2 comments

Could be due to a bug related to reading uninitialized data on the stack. That could lead to using the wrong number of bits for the hash, causing an unnecessarily high number of hash collisions, which makes it more expensive due to false positives that needs to be verified. I think it's fixed in the latest master, and the benchmark code now prints the number of collisions per test case, which could be useful debug data.

Also, I think it would be more interesting to see a comparison with lmdb using random writes instead of sequential.

As for the cpu time measurement, the wallclock is very inprecise, so it could be some small quantum larger than cpu time, but it should never be more than the system specific wall clock quantum.

re: random insert order - if we just revert to the original key format you'll get this: http://www.openldap.org/lists/openldap-devel/200711/msg00002... It becomes a worst-case insert order. If you want to do an actual random order, with a shuffled list so there are no repeats, you'll get something like the September 2012 LMDB microbench results. If you just use rand() and don't account for duplicates you'll get something like the July 2012 LMDB microbench results.
(I've updated my repo using gettimeofday/getrusage).