Hacker News new | ask | show | jobs
by quietbritishjim 341 days ago
The article claims that, when they switched to io_uring,

> throughput increased by an order of magnitude almost immediately

But right near the start is the real story: the sync version had

> the classic fsync() call after every write to the log for durability

They are not comparing performance of sync APIs vs io_uring. They're comparing using fsync vs not using fsync! They even go on to say that a problem with async API is that

> you lose the durability guarantee that makes databases useful. ... the data might still be sitting in kernel buffers, not yet written to stable storage.

No! That's because you stopped using fsync. It's nothing to do with your code being async.

If you just removed the fsync from the sync code you'd quite possibly get a speedup of an order of magnitude too. Or if you put the fsync back in the async version (I don't know io_uring well enough to understand that but it appears to be possible with "io_uring_prep_fsync") then that would surely slide back. Would the io_uring version still be faster either way? Quite possibly, but because they made an apples-to-oranges comparison, we can't know from this article.

(As other commenters have pointed out, their two-phase commit strategy also fails to provide any guarantee. There's no getting around fsync if you want to be sure that your data is really on the storage medium.)

4 comments

> > you lose the durability guarantee that makes databases useful. ... the data might still be sitting in kernel buffers, not yet written to stable storage.

> No! That's because you stopped using fsync. It's nothing to do with your code being async.

From that section, it sounds like OP was tossing data into the io_uring submition queue and calling it "done" at that point (ie: not waiting for the io_uring completion queue to have the completion indicated). So yes, fsync is needed, but they weren't even waiting for the kernel to start the write before indicating success.

I think to some extent things have been confused because io_uring has a completion concept, but OP also has a separate completion concept in their dual wal design (where the second WAL they call the "completion" WAL).

But I'm not sure if OP really took away the right understanding from their issues with ignoring io_uring completions, as they then create a 5 step procedure that adds one check for an io_uring completion, but still omits another.

> 1. Write intent record (async)

> 2. Perform operation in memory

> 3. Write completion record (async)

> 4. Wait for the completion record to be written to the WAL

> 5. Return success to client

Note the lack of waiting for the io_uring completion of the intent record (and yes, there's still not any reference to fsync or alternates, which is also wrong). There is no ordering guarantee between independent io_urings (OP states they're using separate io_uring instances for each WAL), and even in the same io_uring there is limited ordering around completions (IOSQE_IO_LINK exists, but doesn't allow traversing submission boundaries, so won't work here because OP submits the work a separate times. They'd need to use IOSQE_IO_DRAIN which seems like it would effectively serialize their writes. which is why It seems like OP would need to actually wait for completion of the intent write).

Correct, TFA needs to wait for the completion of _all_ writes to the WAL, which is what `fsync()` was doing. Waiting only for the completion of the "completion record" does not ensure that the "intent record" made it to the WAL. In the event of a power failure it is entirely possible that the intent record did not make it but the completion record did, and then on recovery you'll have to panic.
Yes, but I suspect there might be some confusion by the author and others between "io_uring completion of a write" (ie: io_uring sends its completion queue event that corresponds to a previous submission queue event) and "fsync completion" (as you've put as "completion of all writes", though note that fsync the api is fd scoped and the io_uring operation for fsync has file range support).

The CQEs on a write indicate something different compared to the CQE of an fsync operation on that same range.

Suggest watching the Tigerbeatle video link in the article. There they discuss bitrot, "fsync gate", how Postgres used fsync wrong for 30 years, etc. It is very interesting even as pure entertainment.
Thanks! Great to hear you enjoyed our talk. Most of it is simply putting the spotlight on UW-Madison’s work on storage faults.

Just to emphasize again that this blog post here is really quite different, since it does not fsync and breaks durability.

Not what we do in TigerBeetle or would recommend or encourage.

See also: https://news.ycombinator.com/item?id=44624065

Hi! I don't have a need for your products directly, but I was very intrigued when I saw TB's demo and talk on ThePrimeagen YT channel. I have be developing software for a looooong time and it was a breath of fresh air in a sea of startups to see a company champion optimization, speed, and security without going too deep in the weeds and slowing development. These days, that typically comes more as an afterthought or as a response to an incident. Or not at all. I would recommend any developer with an open mind to read this short document[0]. I have been integrating it into my own company's development practices with good results.

[0]https://github.com/tigerbeetle/tigerbeetle/blob/main/docs/TI...

Appreciate your taking the time to write these kind words. Great to hear that TigerStyle has been making an impact on your company’s developer practices!
So OP's real point is that fsync() sucks in the context of modern hardware where thousands of I/O reqs may be in flight at any given time. We need more fine-grained mechanisms to ensure that writes are committed to permanent storage, without introducing undue serialization.
Well, there already is slightly more fine gained control: in the sync version, you can perhaps call sync write() a few times before calling fsync() once i.e. basically batch up a few writes. That does have the disadvantage that you can't easily queue new writes while waiting for the previous ones. Perhaps you could use calls to write() in another thread while the first one is waiting for fsync() for the previous batch? You could even have lots of threads doing that in parallel, but probably not the thousands that you mentioned. I don't know the nitty gritty of Linux file IO well enough to know how well that would work.

As I said, I don't know anything about fsync in io_uring. Maybe that has more control?

An article that did a fair comparison, by someone who actually knows what they're talking about, would be pretty interesting.

> As I said, I don't know anything about fsync in io_uring. Maybe that has now control?

io_uring fsync has byte range support: https://man7.org/linux/man-pages/man2/io_uring_enter.2.html#...

Sorry, that was a typo in my comment (now edited). "Now" was meant to be "more" i.e. "perhaps [io_uring] has more control [than sync APIs]?"

Byte range is support is interesting but also present in the Linux sync API:

https://man7.org/linux/man-pages/man2/sync_file_range.2.html

I meant more like, perhaps it's possible to concurrently queue fsync for different writes in a way that isn't possible with the sync API. From your link, it appears not (unless they're isolated at non-overlapping byte ranges, but that's no different from what you can do with sync API + threads):

> Note that, while I/O is initiated in the order in which it appears in the submission queue, completions are unordered. For example, an application which places a write I/O followed by an fsync in the submission queue cannot expect the fsync to apply to the write. The two operations execute in parallel, so the fsync may complete before the write is issued to the storage.

So if two writes are for an overlapping byte range, and you wanted to write + fsync the first one then write + fsync the second then you'd need to queue those four operations in application space, ensuring only one is submitted to io_uring at a time.

> Byte range is support is interesting but also present in the Linux sync API: https://man7.org/linux/man-pages/man2/sync_file_range.2.html

Unfortunately, I think sync_file_range() provides much weaker guarantees than byte-range fsync() and even byte-range fdatasync().

As I understand it from historical behaviour and documentation, sync_file_range() doesn't push durability barriers down the underlying storage devices, nor does it ensure that all metadata needed to access the written pages is itself written and made durable, for example when writing to a hole in a sparse file, to the end-hole created by enlarging a file with ftruncate(), or to fallocate'd pages.

As a result, that means sync_file_range() can only be used as a performance tweak, and not for any durability guarantees that fdatasync() / fsync() are used for.

I'd be delighted to find this has improved since I last looked, but that's what I recall about sync_file_range().

You can insert synchronization OPs (i.e. barriers) in the queue to guarantee in-order execution.
You can also directly link submitted operations into a chain that will be executed in-order but without ordering dependencies on other operations not submitted as part of the chain.
Postgres claims to have some kind of commit batching, but I couldn't figure out how to turn it on.

I wanted to scrub a table by processing each row, but without holding locks, so I wanted to commit every few hundred rows, but with only ACI and not D, since I could just run the process again. I don't think Postgres supports this feature. It also seemed to be calling fsync much more than once per transaction.

> It also seemed to be calling fsync much more than once per transaction.

If it's called many more times than once per transaction the likely reason is that wal_buffers is sized small. Whenever generated WAL exceeds wal_buffers, postgres flushes the WAL, so it does not have to reopen the file later. At that point you already gotten most benefits from batching too.

Edit: A second reason is that data pages need to be written out due to cache pressure or such, and that requires the WAL to be flushed first.

Looking through the options listed under "Non-Durable Settings", [1] I guess synchronous_commit = off fits the bill?

[1]: https://www.postgresql.org/docs/current/non-durability.html

Nope, Other commenter noted it:

https://www.postgresql.org/docs/current/runtime-config-wal.h...

Don't use synchronous_commit = off is durability ~= 0 (i.e. "I hope the write made it to disk")

Maybe I don’t understand what you’re trying to do, but you can directly control how frequently commits occur.

    BEGIN
    INSERT … —- batch of N size
    COMMIT AND CHAIN
    INSERT …
Chance of Postgres commit mapping 1:1 onto posix fsync or equivalent: slim.
Without parallelism, each commit will be at least one fdatasync (or fsync, O_SYNC/O_DSYNC write, depending on configuration). With parallelism, concurrent transaction might be flushed together, reducing the total number of fsyncs.
Some applications, like Apache Kafka, don't immediately fsync every write. This lets the kernel batch writes and also linearize them, both adding speed. Until synced, the data exists only in the linux page cache.

To deal with the risk of data loss, multiple such servers are used, with the hope that if one server dies before syncing, another server to which the data was replicated, performs an fsync without failure.

I feel like you can try to FAFO with that on a distributed log like Kafka (although also... eww, but also I wonder whether NATS does the same thing or not...)

I would think for something like a database, at most you'd want to have something like the io_uring_prep_fsync others mentioned with flags set to just not update the metadata.

To be clear, in my head I'm envisioning this case to be a WAL type scenario; in my head you can get away with just having a separate thread or threads pulling from WAL and writing to main DB files... but also I've never written a real database so maybe those thoughts are off base.

The Linux RWF_DSYNC flag sets the Full Unit Access (FUA) bit in write requests. This can be used instead of fdatasync(2) in some cases. It only syncs a specific write request instead of the entire disk write cache.
You should prefer RWF_SYNC in case the write involves changes to the file metadata (For example, most append operations will alter the file size).
Agreed, when metadata changes are involved then RWF_SYNC must be used.

RWF_DSYNC is sufficient and faster when data is overwritten without metadata changes to the file.

No that’s incorrect. File size changes caused by append are covered by fdatasync in terms of durability guarantees.
It looks plausible: XFS's xfs_dio_write_end_io() updates the on-disk file size. Do you have a link to documentation that confirms this is true for Linux or POSIX filesystems?

Edit: POSIX 1003.1-2017 defines fdatasync(2) behavior in 3.384 Synchronized I/O Data Integrity Completion, where it says "For write, when the operation has been completed or diagnosed if unsuccessful. The write is complete only when the data specified in the write request is successfully transferred and all file system information required to retrieve the data is successfully transferred".

So I think POSIX does guarantee that a write at the end of the file with O_DSYNC/followed by fdatasync(2) (and therefore, Linux RWF_DSYNC) is sufficient. Thank you for pointing out that RWF_DSYNC is sufficient for appends, vlovich123!

Not really, RWF_DSYNC is equivalent to open(2) with O_DSYNC when writing which is equivalent to write(2) followed by fdatasync(2) and:

  fdatasync() is similar to fsync(), but does not flush modified
       metadata unless that metadata is needed in order to allow a
       subsequent data retrieval to be correctly handled.  For example,
       changes to st_atime or st_mtime (respectively, time of last access
       and time of last modification; see inode(7)) do not require
       flushing because they are not necessary for a subsequent data read
       to be handled correctly.  On the other hand, a change to the file
       size (st_size, as made by say ftruncate(2)), would require a
       metadata flush.
> There's no getting around fsync if you want to be sure that your data is really on the storage medium.

That's not correct; io_uring supports O_DIRECT write requests just fine. Obviously bypassing the cache isn't the same as just flushing it (which is what fsync does), so there are design impacts.

But database engines are absolutely the target of io_uring's feature set and they're expected to be managing this complexity.

O_DIRECT is not a substitute for fsync(). It only guarantees that data gets to the storage device cache, which is not durable in most cases.
My understanding is that the storage device cache is opaque, that is, drives tend to lie, saying the write is done when it is in cache, and depend on having enough internal power capacity to flush on power loss.
Consumer devices sometimes lie (enterprise products less so), but there is a distinction between O_DIRECT and actual fsync at the protocol layer (e.g., in NVMe, fsync maps into a Flush command).
> But database engines are absolutely the target of io_uring's feature set and they're expected to be managing this complexity.

io_uring includes an fsync opcode (with range support). When folks talk about fsync generally here, they're not saying the io_uring is unusable, they're saying that they'd expect the fsync to be used whether it's via the io_uring opcode, the system call, or some other mechanism yet to be created.

That's not what O_DIRECT is for. Did you mean O_SYNC ?
Is that's true (notwithstanding objections from sibling comments) then that's just another spelling of fsync.

My point was really: you can't magically get the performance benefits of omitting fsync (or functional equivalent) while still getting the durability guarantees it gives.