|
|
|
|
|
by anarazel
1829 days ago
|
|
With a lot of NVMe devices, up to medium priced server gear, the bottleneck in OLTP workloads isn't normal write latency, but slow write cache flushes. On devices with write caches one either needs to fdatasync() the journal on commit (which typically issues a whole device cache flush) or use O_DIRECT | O_DSYNC (ending up as a FUA write which just tags the individual write as needing to be durable) for journal writes. Often that drastically increases latency and slows down concurrent non-durable IO, reducing the benefit of deeply queued IO substantially. On top-line gear this isn't an issue, they don't signal a write cache (by virtue of either having a non-volatile cache or enough of a power reserve to flush the cache). Which then prevents the OS from actually doing more expensive for fdatasync()/O_DSYNC. One also can manually ignore the need for caching by changing /sys/block/nvme*/queue/write_cache to say write through, but that obviously looses guarantees - but can be useful to test on lower end devices. |
|
> Multiple instances per device are almost a must.
Isn't actually unproblematic in OLTP, because it increases the number of journal writes that need to be flushed. With a single instance group commit can amortize the write cache flush costs much more efficiently than with many concurrent instances all separately doing much smaller group commits.