Hacker News new | ask | show | jobs
by hyc_symas 1088 days ago
You don't need more than single-writer concurrency if your write txns are fast enough.

Our experience with OpenLDAP was that multi-writer concurrency cost too much overhead. Even though you may be writing primary records to independent regions of the DB, if you're indexing any of that data (which all real DBs do, for query perf) you wind up getting a lot of contention in the indices. That leads to row locking conflicts, txn rollbacks, and retries. With a single writer txn model, you never get conflicts, never need rollbacks.

2 comments

> You don't need more than single-writer concurrency if your write txns are fast enough.

This only works on systems with sufficiently slow storage. If your server has a bunch of NVMe, which is a pretty normal database config these days, you will be hard-pressed to get anywhere close to the theoretical throughput of the storage with a single writer. That requires 10+ GB/s sustained. It is a piece of cake with multiple writers and a good architecture.

Writes through indexing can be sustained at this rate (assuming appropriate data structures), most of the technical challenge is driving the network at the necessary rate in my experience.

That's all just false. Just because you're single-writer at the application level doesn't mean the OS isn't queueing enough writes to saturate storage at the device level. We've benchmarked plenty of high speed NVMe devices, like Intel Optane SSDs, etc. showing this. http://www.lmdb.tech/bench/optanessd/
this guy is a fool. Ignore him. Or see his website.
That's probably because your OpenLDAP benchmarks used a tiny database. If you have multi-terabyte databases, you will start to see huge gains from a multi-writer setup because you will be regularly be loading pages from disk, rather than keeping almost all of your btree/LSM tree in RAM.
Yeah, no. Not with a DB 50x larger than RAM, anyway.

http://www.lmdb.tech/bench/hyperdex/

RAM is relatively cheap too, there's no real reason to be running multi-TB databases at greater than a 50x ratio.

Sorry, but "50x larger than RAM" is a pretty small DB - that's an 800 GB database on a machine with 16 GB of RAM. I usually have seen machines with 500-1000x ratios of flash to RAM. "RAM is relatively cheap" is also false when you're storing truly huge amounts of data, which is how the systems you compare yourself to (LevelDB, etc) are usually deployed. Note that RAM is now the single greatest cost when buying servers.

> Now that the total database is 50 times larger than RAM, around half of the key lookups will require a disk I/O.

That is an insanely high cache hit rate, which should have probably set off your "unrepresentative benchmark" detector. I am also a little surprised at the lack of a random writes benchmark. I get that this is marketing material, though.

> I am also a little surprised at the lack of a random writes benchmark.

Eh? This was 20% random writes, 80% random reads. LMDB is for read-heavy workloads.

> That is an insanely high cache hit rate, which should have probably set off your "unrepresentative benchmark" detector.

No, that is normal for a B+tree; the root page and most of the branch pages will always be in cache. This is why you can get excellent efficiency and performance from a DB without tuning to a specific workload.

> Eh? This was 20% random writes, 80% random reads. LMDB is for read-heavy workloads.

The page says "updates," not "writes." Updates are a constrained form of write where you are writing to an existing key. Updates, importantly, do not affect your index structure, while writes do.

> No, that is normal for a B+tree; the root page and most of the branch pages will always be in cache. This is why you can get excellent efficiency and performance from a DB without tuning to a specific workload.

It is normal for a small B+tree relative to the memory size available on the machine. The "small" was the unrepresentative part of the benchmark, not the "B+tree."

> The page says "updates," not "writes." Updates are a constrained form of write where you are writing to an existing key. Updates, importantly, do not affect your index structure, while writes do.

OK, I see your point. It would only have made things even worse for LevelDB here to do an Add/Delete workload because its garbage compaction passes would have had to do a lot more work.

> It is normal for a small B+tree relative to the memory size available on the machine. The "small" was the unrepresentative part of the benchmark, not the "B+tree."

This was 100 million records, and a 5-level deep tree. To get to 6 levels deep it would be about 10 billion records. Most of the branch pages would still fit in RAM; most queries would require at most 1 more I/O than the 5-level case. The cost is still better than any other approach.