| fsync waits for the drive to report back the success write. When you do a ton of small writes, fsync becomes a bottleneck. Its a issue of context switching and pipelining with fsync. When you async write data, you do not need to wait for this confirmation. So by double writing two async requests, you are better using all your system CPU cores as they are not being stalled waiting for that I/O response. Seeing a 10x performance gain is not uncommon using a method like this. Yes, you do need to check if both records are written and then report it back to the client. But that is a non-fsync request and does not tax your system the same as fsync writes. It has literally the same durability as a fsync write. You need to take in account, that most databases are written 30, 40 ... years ago. In the time when HDDs ruled and stuff like NVME drives was a pipedream. But most DBs still work the same, and threat NVME drives like they are HDDs. Doing this above operation on a HDD, will cost you 2x the performance because you barely have like 80 to 120 IOPS/s. But a cheap NVME drive easily does 100.000 like its nothing. If you even monitored a NVME drive with a database write usage, you will noticed that those NVME drives are just underutilized. This is why you see a lot more work in trying new data storage layers being developed for Databases that better utilize NVME capabilities (and trying to bypass old HDD era bottlenecks). |
I don't think we can ensure this without knowing what fsync() maps to in the NVMe standard, and somehow replicating that. Just reading back is not enough, e.g. the hardware might be reading from a volatile cache that will be lost in a crash.