| HN Mirror

Filesystem writes don't always correspond directly to storage sector writes.

Sometimes it's possible for the filesystem API to guarantee that "sector"-size writes are atomic, even though the underlying storage doesn't do that.

For example a filesystem which uses copy-on-write when overwriting data, by appending each written block to a journal and checksumming or checkpointing the journal, is likely to offer that property.

Sometimes a block device knows that sector writes are atomic too. For example some battery-backed RAID controllers can reasonably guarantee this. Of course it fails if the battery runs out, but the abstraction is intended to assume you never allow that to happen.

It can go the other way as well. When underlying storage provides guaranteed atomic sector writes, the overlaid storage might not. RAID-5 does not provide atomic sector writes even when the underlying storage units do.

It would be useful for them to report when they have that property, because it would allow SQLite to use fewer writes. In effect, if the filesystem already uses a mechanism to ensure atomicity at some performance cost, there's no need for SQLite to use a second mechanism on top at a second performance cost.

It's actually worse than non-atomic. Writing to a sector on RAID-5 (without battery backup) corrupts other sectors too, during the time window until they are all made consistent. I call this the "radius of destruction", and I'm pretty sure SQLite and other software ignores this problem because there's no API for finding out about it.

So filesystems could report:

- "Don't know" for when writes go direct to the underlying block device. Or better, report whatever the block device reports for this query, which should be "don't know" in most cases.

- "Yes it's atomic" for when the filesystem layer (or block translation layer, flash translation layer or whatever there is) knows that it provides a reliable atomic-block-write abstraction on top of storage which doesn't provide that.

- "Watch out, we don't even guarantee your sector write won't corrupt related sectors in a geometry group..."