Hacker News new | ask | show | jobs
by random_comment 3243 days ago
ZFS has journalling too, in the form of the ZFS Intent Log, where writes are placed rapidly as they occur before becoming part of the main filesystem. The effect is similar to a journal in a journalling filesystem [i.e. to allow recovery and consistency of writes in flight in the event of power loss], but the way it is used is different.

Unless most journaled filesystems, ZFS:

- allows a separate high-performance device to be used for the log. This is important because the cost of journalling can be high when lots of fsyncs are being used to ensure integrity (i.e. try running a write performance test on a database like postgresql using ext4 with and without journalling, you'll see a difference).

- the filesystem log can be mirrored physically, to protect against the risk of log device failure [which would endanger writes in flight].

Other similarities/differences:

In a journalling FS, you need to take the filesystem offline and check the journal. In ZFS, there is continual passive checking of file data and metadata at time of access, as well as the option for an online 'scrub' that is similar to the fsck of a journalled filesystem without requiring dismounting of the filesystem.

While copy-on-write by itself may not be necessarily strictly superior to journalling, ZFS is strictly superior to either.

2 comments

> - allows a separate high-performance device to be used for the log. This is important because the cost of journalling can be high when lots of fsyncs are being used to ensure integrity (i.e. try running a write performance test on a database like postgresql using ext4 with and without journalling, you'll see a difference).

In a proper setup (mount options journal=writeback,noatime,relatime, wal configured reasonably wrt max_wal_size/checkpoint_segments) the overhead due to ext4 journaling shouldn't be a major factor. You'll see some overhead initially when the WAL segments are allocated as you go, but after that they'll be recycled.

For OLTP write heavy databases I'd say the intent log is more a liability than an advantage, it's easy to screw over performance and/or storage lifetime with it.

This is better than RAID patrol reads only in that it also verifies file system structure periodically. And you do not necessarily have to bring down the filesystem to check even when data is in flight as long as it's driver supports online scan functionality. More than one FS does so. (XFS, btrfs and probably JFS. Not ext4 though.)

Not that online scanning makes to much sense anyway. The good filesystems verify sanity of the structure they traverse, so might as well put in a full FS read in cron. Most kinds of damage cannot be repaired on a live filesystem anyway. Even in ZFS.

> might as well put in a full FS read in a cron

ZFS scrub is not the same thing.

If you do a full-filesystem read in a RAID system at the OS level, the redundant blocks won't be read: the RAID system will simply choose one of the copies to read based on which disk(s) is least heavily loaded at the moment. This is why reading on a 2-disk mirror is twice as fast as reading from a single one of the disks comprising the mirror.

During a ZFS scrub, all copies of every block are checked, and because the data is heavily checksummed, ZFS knows which copy is right if one of the 2+ redundant copies doesn't match its checksum.

Additionally, ZFS is structured as a Merkle tree (https://en.wikipedia.org/wiki/Merkle_tree) which avoids whole classes of ways traditional filesystems can become deranged at a structural level. ZFS always stores 3+ copies of certain types of filesystem metadata, even on a 1-disk ZFS pool, so that if one gets corrupted, it has 2+ others to choose from. When this same type of corruption happens on a traditional filesystem, well, let's just say that's why `/lost+found` exists.

> Most kinds of damage cannot be repaired on a live filesystem anyway.

See my post above, giving two anecdotes of ZFS actively repairing data on live filesystems. Both systems were in continuous use while these repairs proceeded, and no data were lost in either.

> Most kinds of damage cannot be repaired on a live filesystem anyway. Even in ZFS.

You're totally wrong.

The easiest way to demonstrate why is for you to set up a script to randomly write zeros/junk in any amount, at any time, anywhere over one of the block devices being used by ZFS, all day every day.

[Assuming you're using one of the available forms of redundancy i.e. multiple copies, ZRAID1/2, or mirroring etc.]

Sit back and watch ZFS giving no fucks at all as it repairs all the damage passively.

You can even introduce such damage in moderate quantities across all of the block devices used by ZFS. Again, you'll see a goddamn incredible amount of self-healing going on and accurate reporting about where it's unable to recover files due to the damage across multiple volumes being too extensive.

It's unlikely that even in this extreme instance of willful massive harm to the disks you'll see the filesystem being damaged because a) filesystem metadata is checksummed too b) the metadata blocks are automatically stored twice in different places c) you also have the redundancy of multiple devices e.g. mirroring/zraid.

Try it, prove me wrong.