| "Are sector writes atomic? And if so, for what size sectors?" For a database, this is not the question you want to be asking. It's a rabbit hole. The right answer (which is not always the answer we want, so we keep asking): no one knows. It's better design to err on the side of assuming disk sector writes are never atomic. This would be taking the security stance, e.g. assume an attacker is actively trying to break your protocol, and use cryptographic guarantees to prevent or at least detect them. Kernel developers also can't answer these kinds of hardware questions. There's too much hardware out there. Disks may give atomicity guarantees and then you discover edge cases where these degrade, or disks may emulate Advanced Format 4096 byte logical sectors but have smaller 512 byte physical sectors underneath. Disks are too complicated (and firmware too buggy) to trust any kind of guarantee. There are too many layers of abstraction. Instead of trying to figure out the safety of the layer beneath you, assume the worst and bring the design back to using end-to-end guarantees. If you assume and plan for the worst, then you don't need to ask the question, and you can handle the worst of hardware, without surprises when those guarantees are broken. In fact, with file systems in general, the situation is even worse than for disks, because of disk corruption (3% per 36 month period per disk) and misdirected writes, which most file systems unfortunately don't handle. You should never trust any metadata provided by the file system about your journal file (e.g. file size) and use that as part of your recovery protocol, because the file system itself is storing this kind of metadata on the same disk you're writing too. Again, something as important as the metadata size of your journal file should be protected by your own end-to-end protocol. |
Sometimes it's possible for the filesystem API to guarantee that "sector"-size writes are atomic, even though the underlying storage doesn't do that.
For example a filesystem which uses copy-on-write when overwriting data, by appending each written block to a journal and checksumming or checkpointing the journal, is likely to offer that property.
Sometimes a block device knows that sector writes are atomic too. For example some battery-backed RAID controllers can reasonably guarantee this. Of course it fails if the battery runs out, but the abstraction is intended to assume you never allow that to happen.
It can go the other way as well. When underlying storage provides guaranteed atomic sector writes, the overlaid storage might not. RAID-5 does not provide atomic sector writes even when the underlying storage units do.
It would be useful for them to report when they have that property, because it would allow SQLite to use fewer writes. In effect, if the filesystem already uses a mechanism to ensure atomicity at some performance cost, there's no need for SQLite to use a second mechanism on top at a second performance cost.
It's actually worse than non-atomic. Writing to a sector on RAID-5 (without battery backup) corrupts other sectors too, during the time window until they are all made consistent. I call this the "radius of destruction", and I'm pretty sure SQLite and other software ignores this problem because there's no API for finding out about it.
So filesystems could report:
- "Don't know" for when writes go direct to the underlying block device. Or better, report whatever the block device reports for this query, which should be "don't know" in most cases.
- "Yes it's atomic" for when the filesystem layer (or block translation layer, flash translation layer or whatever there is) knows that it provides a reliable atomic-block-write abstraction on top of storage which doesn't provide that.
- "Watch out, we don't even guarantee your sector write won't corrupt related sectors in a geometry group..."