Hacker News new | ask | show | jobs
by quietbritishjim 337 days ago
Well, there already is slightly more fine gained control: in the sync version, you can perhaps call sync write() a few times before calling fsync() once i.e. basically batch up a few writes. That does have the disadvantage that you can't easily queue new writes while waiting for the previous ones. Perhaps you could use calls to write() in another thread while the first one is waiting for fsync() for the previous batch? You could even have lots of threads doing that in parallel, but probably not the thousands that you mentioned. I don't know the nitty gritty of Linux file IO well enough to know how well that would work.

As I said, I don't know anything about fsync in io_uring. Maybe that has more control?

An article that did a fair comparison, by someone who actually knows what they're talking about, would be pretty interesting.

2 comments

> As I said, I don't know anything about fsync in io_uring. Maybe that has now control?

io_uring fsync has byte range support: https://man7.org/linux/man-pages/man2/io_uring_enter.2.html#...

Sorry, that was a typo in my comment (now edited). "Now" was meant to be "more" i.e. "perhaps [io_uring] has more control [than sync APIs]?"

Byte range is support is interesting but also present in the Linux sync API:

https://man7.org/linux/man-pages/man2/sync_file_range.2.html

I meant more like, perhaps it's possible to concurrently queue fsync for different writes in a way that isn't possible with the sync API. From your link, it appears not (unless they're isolated at non-overlapping byte ranges, but that's no different from what you can do with sync API + threads):

> Note that, while I/O is initiated in the order in which it appears in the submission queue, completions are unordered. For example, an application which places a write I/O followed by an fsync in the submission queue cannot expect the fsync to apply to the write. The two operations execute in parallel, so the fsync may complete before the write is issued to the storage.

So if two writes are for an overlapping byte range, and you wanted to write + fsync the first one then write + fsync the second then you'd need to queue those four operations in application space, ensuring only one is submitted to io_uring at a time.

> Byte range is support is interesting but also present in the Linux sync API: https://man7.org/linux/man-pages/man2/sync_file_range.2.html

Unfortunately, I think sync_file_range() provides much weaker guarantees than byte-range fsync() and even byte-range fdatasync().

As I understand it from historical behaviour and documentation, sync_file_range() doesn't push durability barriers down the underlying storage devices, nor does it ensure that all metadata needed to access the written pages is itself written and made durable, for example when writing to a hole in a sparse file, to the end-hole created by enlarging a file with ftruncate(), or to fallocate'd pages.

As a result, that means sync_file_range() can only be used as a performance tweak, and not for any durability guarantees that fdatasync() / fsync() are used for.

I'd be delighted to find this has improved since I last looked, but that's what I recall about sync_file_range().

You can insert synchronization OPs (i.e. barriers) in the queue to guarantee in-order execution.
You can also directly link submitted operations into a chain that will be executed in-order but without ordering dependencies on other operations not submitted as part of the chain.
Postgres claims to have some kind of commit batching, but I couldn't figure out how to turn it on.

I wanted to scrub a table by processing each row, but without holding locks, so I wanted to commit every few hundred rows, but with only ACI and not D, since I could just run the process again. I don't think Postgres supports this feature. It also seemed to be calling fsync much more than once per transaction.

> It also seemed to be calling fsync much more than once per transaction.

If it's called many more times than once per transaction the likely reason is that wal_buffers is sized small. Whenever generated WAL exceeds wal_buffers, postgres flushes the WAL, so it does not have to reopen the file later. At that point you already gotten most benefits from batching too.

Edit: A second reason is that data pages need to be written out due to cache pressure or such, and that requires the WAL to be flushed first.

Looking through the options listed under "Non-Durable Settings", [1] I guess synchronous_commit = off fits the bill?

[1]: https://www.postgresql.org/docs/current/non-durability.html

Nope, Other commenter noted it:

https://www.postgresql.org/docs/current/runtime-config-wal.h...

Don't use synchronous_commit = off is durability ~= 0 (i.e. "I hope the write made it to disk")

Maybe I don’t understand what you’re trying to do, but you can directly control how frequently commits occur.

    BEGIN
    INSERT … —- batch of N size
    COMMIT AND CHAIN
    INSERT …
Chance of Postgres commit mapping 1:1 onto posix fsync or equivalent: slim.
Without parallelism, each commit will be at least one fdatasync (or fsync, O_SYNC/O_DSYNC write, depending on configuration). With parallelism, concurrent transaction might be flushed together, reducing the total number of fsyncs.