Hacker News new | ask | show | jobs
by kentonv 1086 days ago
> Be prepared to deal with blocks not necessarily updating to disk in the order they were written to, and 10 seconds after the fact. This can make power failures cause inconsistencies.

This is not specific to mmap -- regular old write() calls have the same behavior. You need to fsync() (or, with mmap, msync()) to guarantee data is on disk.

2 comments

> This is not specific to mmap -- regular old write() calls have the same behavior.

This is not true. This depends on how the file was opened. You may request DIRECT | SYNC when opening and the writes are acknowledged when they are actually written. This is obviously a lot slower than writing to cache, but this is the way for "simple" user-space applications to implement their own cache.

In the world of today, you are very rarely writing to something that's not network attached, and depending on your appliance, the meaning of acknowledgement from write() differs. Sometimes it's even configurable. This is why databases also offer various modes of synchronization -- you need to know how your appliance works and configure the database accordingly.

> This is not true. This depends on how the file was opened. You may request DIRECT | SYNC

Well sure, but 99.9% of people don't do that (and shouldn't, unless they really know what they are doing).

> In the world of today, you are very rarely writing to something that's not network attached, and depending on your appliance, the meaning of acknowledgement from write() differs.

What network-attached storage actually uses O_SYNC behavior without being asked? I'd be quite surprised if any did this as it would make typical workloads incredibly slow in order to provide a guarantee they didn't ask for.

100% of people writing a database know about filesystem options like DIRECT and SYNC, and that is the subject of this paper.

Also, most of the network-attached storage we people use is in the form of things like EBS, which is very careful to imitate the behavior of a real disk, but with different performance and some different (albeit very rare) failure modes.

100% of people writing databases also know how fsync() and msync() work. I interpreted this thread as being targeted at a wider audience.
It's literally for people writing their own database. Why would you interpret it differently?
It's fun to remember that fsync() on Linux on ext4 at least offers no real guarantee that the data was successfully written to disk. This happens when write errors from background buffered writes are handled internally by the kernel, and they cleanup the error situation (mark dirty pages clean etc). Since the kernel can't know if a later call to fsync() will ever happen, it can't just keep the error around. So, when the call does happen, it will not return any error code. I don't know for sure, but msync() may well have the same behavior.

Here is an LWN article discussing the whole problem as the Postgres team found out about it.

https://lwn.net/Articles/752063/