| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by benjiro 337 days ago

fsync waits for the drive to report back the success write. When you do a ton of small writes, fsync becomes a bottleneck. Its a issue of context switching and pipelining with fsync.

When you async write data, you do not need to wait for this confirmation. So by double writing two async requests, you are better using all your system CPU cores as they are not being stalled waiting for that I/O response. Seeing a 10x performance gain is not uncommon using a method like this.

Yes, you do need to check if both records are written and then report it back to the client. But that is a non-fsync request and does not tax your system the same as fsync writes.

It has literally the same durability as a fsync write. You need to take in account, that most databases are written 30, 40 ... years ago. In the time when HDDs ruled and stuff like NVME drives was a pipedream. But most DBs still work the same, and threat NVME drives like they are HDDs.

Doing this above operation on a HDD, will cost you 2x the performance because you barely have like 80 to 120 IOPS/s. But a cheap NVME drive easily does 100.000 like its nothing.

If you even monitored a NVME drive with a database write usage, you will noticed that those NVME drives are just underutilized. This is why you see a lot more work in trying new data storage layers being developed for Databases that better utilize NVME capabilities (and trying to bypass old HDD era bottlenecks).

2 comments

zozbot234 337 days ago

> It has literally the same durability as a fsync write

I don't think we can ensure this without knowing what fsync() maps to in the NVMe standard, and somehow replicating that. Just reading back is not enough, e.g. the hardware might be reading from a volatile cache that will be lost in a crash.

link

benjiro 337 days ago

Unless your running cheap consumer NVME drives, that is not a issue on Enterprise SSD/NVMEs as they have their own capacitors to ensure data is always written.

On cheaper NVME drives, your point is valid. But we also need to add, how much at risk are you. What is the chance of a system doing funky issues, that you just happened to send X amount of confirm requests to clients, with data that never got written.

For specific companies, they will not cheap out and spend tons of enterprise level of hardware. But for the rest of us? I mean, have you seen the German Hetzner, where 97% of their hardware is mostly consumer level hardware. Yes, there is a risk, but nobody complains about that risk.

And frankly, everything can be a risk if you think about it. I have had EXT3 partition's corrupt on a production DB server. That is why you have replication and backups ;)

TiDB, or was it another distributed DB is also not consistency guaranteed, if i remember correctly. They give for performance eventual consistency.

link

gpderetta 337 days ago

Forget about consumer FD, unless you are explicitly doing O_DIRECT, why would you expect that a notification that your IO has completed would mean that it has reached the disk at all? The data might still be just in the kernel page buffer and not gotten close to the disk at all.

You mention you need to wait for the compilation record to be written. But how do you do that without fsync or O_DIRECT? A notification that the write is completed is not that.

Edit: maybe you are using RWF_SYNC in your write call. That could work.

link

codys 337 days ago

> Yes, you do need to check if both records are written and then report it back to the client. But that is a non-fsync request and does not tax your system the same as fsync writes.

What mechanism can be used to check that the writes are complete if not fsync (or adjacent fdatasync)? What specific io_uring operation or system call?

link