Hacker News new | ask | show | jobs
by isotopp 1827 days ago
In NVME you can get around 800.000 IOPS from a single device, but the latency gives you around 20.000 IOPS sequentially. You need to talk with deep queues or with multiple concurrent threads to the device in order to eat the entire IOPS buffet.

Traditional OLTP workloads do not tend to have the concurrency to actually saturate the NVME. You would need to be 40-way parallel, but most OLTP workloads give you 4-way.

Multiple instances per device are almost a must.

2 comments

With a lot of NVMe devices, up to medium priced server gear, the bottleneck in OLTP workloads isn't normal write latency, but slow write cache flushes. On devices with write caches one either needs to fdatasync() the journal on commit (which typically issues a whole device cache flush) or use O_DIRECT | O_DSYNC (ending up as a FUA write which just tags the individual write as needing to be durable) for journal writes. Often that drastically increases latency and slows down concurrent non-durable IO, reducing the benefit of deeply queued IO substantially.

On top-line gear this isn't an issue, they don't signal a write cache (by virtue of either having a non-volatile cache or enough of a power reserve to flush the cache). Which then prevents the OS from actually doing more expensive for fdatasync()/O_DSYNC. One also can manually ignore the need for caching by changing /sys/block/nvme*/queue/write_cache to say write through, but that obviously looses guarantees - but can be useful to test on lower end devices.

One consequence of that is that:

> Multiple instances per device are almost a must.

Isn't actually unproblematic in OLTP, because it increases the number of journal writes that need to be flushed. With a single instance group commit can amortize the write cache flush costs much more efficiently than with many concurrent instances all separately doing much smaller group commits.

> You need to talk with deep queues or with multiple concurrent threads to the device in order to eat the entire IOPS buffet.

Completely agree. There is another angle you can play if you are willing to get your hands dirty at the lowest levels.

If you build a custom database engine that fundamentally stores everything as key-value, and then builds relational abstractions on top, you can leverage a lot more benefit on a per-transaction basis. For instance, if you are storing a KVP per column in a table and the table has 10 columns, you may wind up generating 10-20 KVP items per logical row insert/update/delete. And if you are careful, you can make sure this extra data structure expressiveness does not cause write amplification (single writer serializes and batches all transactions).

  > If you build a custom database 
  > engine that fundamentally stores
  > everything as key-value, and then
  > builds relational abstractions on
  > top
Sounds like this could be FoundationDB, among other contenders like TiDB.

https://foundationdb.org

More like myRocks. FoundationDB doesn't use LSM and definitely wants to do lots of overwriting in place. TiDB uses Rocksdb and would be closer.
You may want to play with a TiDB setup from Pingcap.