Hacker News new | ask | show | jobs
by _vvhw 1979 days ago
Yes, and it's not only about performance, but also safety because O_DIRECT is the only safe way to recover from the journal after fsync failure (when the page cache can no longer be trusted by the database to be coherent with the disk): https://www.usenix.org/system/files/atc20-rebello.pdf

From a safety perspective, O_DIRECT is now table stakes. There's simply no control over the granularity of read/write EIO errors when your syscalls only touch memory and where you have no visibility into background flush errors.

2 comments

Around four years ago I was working on a transactional data store and ran into these issues that virtually no one tells you how durable I/O is supposed to work. There were very few articles on the internet that went beyond some of the basic stuff (e.g. create file => fsync directory) and perhaps one article explaining what needs to be considered when using sync_file_range. Docs and POSIX were useless. I noticed that there seemed to be inherent problems with I/O error handling when using the page cache, i.e. whenever something that wasn't the app itself caused write I/O you really didn't know any more if all the data got there.

Some two years later fsyncgate happened and since then I/O error handling on Linux has finally gotten at least some attention and people seemed to have woken up to the fact that this is a genuinely hard thing to do.

What was the data store you were working on? Is it open source?

My experience was the same as you.

What helped me was discovering all the fantastic storage and file system papers coming out of the University of Wisconsin Madison, supervised by Remzi and Andrea Arpaci-Dusseau.

Their teams have studied and documented almost all aspects of what is required to write reliable storage systems, even diving into interactions between local storage failures and global consensus protocols, how a single disk block failure can destroy Raft and Zookeeper. Most safety testing of these systems tends to focus on the network fault model. I think in a few years time we'll all look back and see how today we had almost no concept of a storage fault model. It's kind of exciting to think that there's going to be a new breed of replicated databases that are far more reliable than today's systems. On the another hand, perhaps the future is already here, just not very evenly distributed.

http://pages.cs.wisc.edu/~remzi/

Their OSTEP book (Operating Systems in Three Easy Pieces) is also a great fun read: http://pages.cs.wisc.edu/~remzi/OSTEP/

> From a safety perspective, O_DIRECT is now table stakes

Except for the awkward problem where O_DIRECT writes don't send a write-barrier to the drives, so the written data can still disappear.

That's a common misunderstanding of the purpose of O_DIRECT.

For a write barrier, you would still use fsync() or O_DSYNC, along with O_DIRECT.

The man page for open(2) is clear on this: https://man7.org/linux/man-pages/man2/open.2.html

I guess I know, as this is what I found when I googled just now :-) https://linux-scsi.vger.kernel.narkive.com/yNnBRBPn/o-direct...

I was trying to address this aspect of the parent comment:

> O_DIRECT is the only safe way to recover from the journal after fsync failure (when the page cache can no longer be trusted by the database to be coherent with the disk)

O_DIRECT is not a safe way to recover from the journal if you have decided you cannot trust fsync to do its job, because you need fsync to make O_DIRECT write-cache durable.

(By the way, O_SYNC/O_DSYNC are equivalent to calling fsync/fdatasync after each write, therefore subject to some of the same issues.)

But even in normal situations with fsync working fine, it is not clear if you can rely on fsync to do a drive write-cache flush when there isn't any metadata or page cache data for the file because you've only been using O_DIRECT.

Neither open(2) or fsync(2) man pages address this durability issue. You can't use O_DSYNC or O_SYNC for good throughout with O_DIRECT because your database does not want the overhead of a write-cache flush on every write. You only want it for barriers. And you can't rely on fdatasync because there's no data to flush in the page cache, no block I/O to do, so fdatasync could meet expectations by doing nothing.

My solution in the past has been to toggle the LSB in st_mtime before async just to force a drive write-cache flush when I'm not sure that anything else will force one. It's not pretty.

> O_DIRECT is not a safe way to recover from the journal if you have decided you cannot trust fsync to do its job, because you need fsync to make O_DIRECT write-cache durable.

I was specifically referring not to an fsync in your sense (where the disk or fs does not respect fsync at all so that fsync is a no-op, or where the fs has a bug with O_DIRECT not flushing if it sees nothing dirty in the page cache - by the way I think this is no longer an issue, otherwise it's a kernel bug you can report)

...but to handling an fsync error in the context of the paper from WISC that I linked to in that parent comment, where the kernel's page cache has gone out of sync with the disk after an fsync EIO error ("Fsyncgate"):

"when the page cache can no longer be trusted by the database to be coherent with the disk: https://www.usenix.org/system/files/atc20-rebello.pdf"

The details are all in the paper. Sure, some disks may not respect fsync, but O_DIRECT is still the only way to safely read and recover from the journal when the kernel's page cache is out of sync with the disk (again, details in the paper). It's another fantastic paper out of WISC.