| The article claims that, when they switched to io_uring, > throughput increased by an order of magnitude almost immediately But right near the start is the real story: the sync version had > the classic fsync() call after every write to the log for durability They are not comparing performance of sync APIs vs io_uring. They're comparing using fsync vs not using fsync! They even go on to say that a problem with async API is that > you lose the durability guarantee that makes databases useful. ... the data might still be sitting in kernel buffers, not yet written to stable storage. No! That's because you stopped using fsync. It's nothing to do with your code being async. If you just removed the fsync from the sync code you'd quite possibly get a speedup of an order of magnitude too. Or if you put the fsync back in the async version (I don't know io_uring well enough to understand that but it appears to be possible with "io_uring_prep_fsync") then that would surely slide back. Would the io_uring version still be faster either way? Quite possibly, but because they made an apples-to-oranges comparison, we can't know from this article. (As other commenters have pointed out, their two-phase commit strategy also fails to provide any guarantee. There's no getting around fsync if you want to be sure that your data is really on the storage medium.) |
> No! That's because you stopped using fsync. It's nothing to do with your code being async.
From that section, it sounds like OP was tossing data into the io_uring submition queue and calling it "done" at that point (ie: not waiting for the io_uring completion queue to have the completion indicated). So yes, fsync is needed, but they weren't even waiting for the kernel to start the write before indicating success.
I think to some extent things have been confused because io_uring has a completion concept, but OP also has a separate completion concept in their dual wal design (where the second WAL they call the "completion" WAL).
But I'm not sure if OP really took away the right understanding from their issues with ignoring io_uring completions, as they then create a 5 step procedure that adds one check for an io_uring completion, but still omits another.
> 1. Write intent record (async)
> 2. Perform operation in memory
> 3. Write completion record (async)
> 4. Wait for the completion record to be written to the WAL
> 5. Return success to client
Note the lack of waiting for the io_uring completion of the intent record (and yes, there's still not any reference to fsync or alternates, which is also wrong). There is no ordering guarantee between independent io_urings (OP states they're using separate io_uring instances for each WAL), and even in the same io_uring there is limited ordering around completions (IOSQE_IO_LINK exists, but doesn't allow traversing submission boundaries, so won't work here because OP submits the work a separate times. They'd need to use IOSQE_IO_DRAIN which seems like it would effectively serialize their writes. which is why It seems like OP would need to actually wait for completion of the intent write).