Hacker News new | ask | show | jobs
by random_comment 3243 days ago
This entire article can be summarised as 'guy who has never used ZFS and has no idea whatsoever about how it works writes a critique that exposes their ignorance publicly'.

Here's a quote:

- “ZFS has CRCs for data integrity

A certain category of people are terrified of the techno-bogeyman named “bit rot.” These people think that a movie file not playing back or a picture getting mangled is caused by data on hard drives “rotting” over time without any warning. The magical remedy they use to combat this today is the holy CRC, or “cyclic redundancy check.” It’s a certain family of hash algorithms that produce a magic number that will always be the same if the data used to generate it is the same every time.

This is, by far, the number one pain in the ass statement out of the classic ZFS fanboy’s mouth..."

Meanwhile in reality...

ZFS does not use CRCs for checksums.

It's very hard to take someone's view seriously when they are making mistakes at this level.

ZFS allows a range of checksum algorithms, including SHA256, and you can even specify per dataset the strength of checksum you want.

- "Hard drives already do it better"

No, they don't, or Oracle/Sun/OpenZFS developers wouldn't have spent time and money making it.

It makes a bit of a difference when your disk says 'whoops, sorry, CRC fail, that block's gone?' and it was holding your whole filesystem together. Or when a power surge or bad component fries the whole drive at once.

ZFS allows optional duplication of metadata or data blocks automatically; as well as multiple levels of RAID-equivalency for automatic, transparent rebuilding of data/metadata in the presence of multiple unreliable or failed devices. Hard drives... don't do that.

Even ZFS running on a single disk can automatically keep 2 (or more) copies on disk of whatever datasets you think are especially important - just check the flag. Regular hard drives don't offer that.

- What about the very unlikely scenario where several bits flip in a specific way that thwarts the hard drive’s ECC? This is the only scenario where the hard drive would lose data silently, therefore it’s also the only bit rot scenario that ZFS CRCs can help with.

Well, that and entire disk failures.

And power failures leading to inconsistency on the drive.

And cable faults leading to the wrong data being sent to the drive to be written.

And drive firmware bugs.

And faulty cache memory or faulty controllers on the hard drive.

And poorly connected drives with intermittent glitches / timeouts in communication.

You get the idea.

I could also point out that ZFS allows you to backup quickly and precisely (via snapshots, and incremental snapshot diffs).

It allows you to detect errors as they appear (via scrubs) rather than find out years later when your photos are filled with vomit coloured blocks.

It also tells you every time it opens a file if it has found an error, and corrected it in the background for you - thank god! This 'passive warning' feature alone lets you quickly realise you have a bad disk or cable so you can do something about it. Consider the same situation with a hard drive over a period of years...

ZFS is a copy-on-write filesystem, so if something naughty happens like a power-cut during an update to a file, your original data is still there. Unlike a hard disk (or RAID).

It's trivial to set up automatic snapshots, which as well as allowing known-point-in-time recovery, are an exceptionally effective way to prevent viruses, user errors etc from wrecking your data. You can always wind back the clock.

Where is the author losing his data (that he knows of, and in his very limited experience...): All of my data loss tends to come from poorly typed ‘rm’ commands. ... so, exactly the kind of situation that ZFS snapshots allow instant, certain, trouble-free recovery from in the space of seconds? [either by rolling back the filesystem, or by conveniently 'dipping into' past snapshots as though they were present-day directories as needed]

Anyway I do hope Mr/Ms nctritech learns to read the beginner's guide for technologies they critique in future, maybe even try them once or twice, before they write their critique.

What next?

"Why even use C? Everything you can do in C, you can do in PHP anyway!"

2 comments

> No, they don't, or Oracle wouldn't have spent money making it.

Tiny nitpick but though Oracle now owns and develops ZFS, Sun Microsystems was the company that initially designed and implemented it. They worked on it for 5 years after they released it, before Oracle acquired them.

Whoops, thanks for the catch. Have updated and also added OpenZFS to that sentence.
Copy on write is a good thing, as is log structure which is even more resilient. However, it is not strictly superior to journaling in terms of data safety. The copied data will get garbage collected or overwritten after some time and regardless might be tricky to recover.
ZFS has journalling too, in the form of the ZFS Intent Log, where writes are placed rapidly as they occur before becoming part of the main filesystem. The effect is similar to a journal in a journalling filesystem [i.e. to allow recovery and consistency of writes in flight in the event of power loss], but the way it is used is different.

Unless most journaled filesystems, ZFS:

- allows a separate high-performance device to be used for the log. This is important because the cost of journalling can be high when lots of fsyncs are being used to ensure integrity (i.e. try running a write performance test on a database like postgresql using ext4 with and without journalling, you'll see a difference).

- the filesystem log can be mirrored physically, to protect against the risk of log device failure [which would endanger writes in flight].

Other similarities/differences:

In a journalling FS, you need to take the filesystem offline and check the journal. In ZFS, there is continual passive checking of file data and metadata at time of access, as well as the option for an online 'scrub' that is similar to the fsck of a journalled filesystem without requiring dismounting of the filesystem.

While copy-on-write by itself may not be necessarily strictly superior to journalling, ZFS is strictly superior to either.

> - allows a separate high-performance device to be used for the log. This is important because the cost of journalling can be high when lots of fsyncs are being used to ensure integrity (i.e. try running a write performance test on a database like postgresql using ext4 with and without journalling, you'll see a difference).

In a proper setup (mount options journal=writeback,noatime,relatime, wal configured reasonably wrt max_wal_size/checkpoint_segments) the overhead due to ext4 journaling shouldn't be a major factor. You'll see some overhead initially when the WAL segments are allocated as you go, but after that they'll be recycled.

For OLTP write heavy databases I'd say the intent log is more a liability than an advantage, it's easy to screw over performance and/or storage lifetime with it.

This is better than RAID patrol reads only in that it also verifies file system structure periodically. And you do not necessarily have to bring down the filesystem to check even when data is in flight as long as it's driver supports online scan functionality. More than one FS does so. (XFS, btrfs and probably JFS. Not ext4 though.)

Not that online scanning makes to much sense anyway. The good filesystems verify sanity of the structure they traverse, so might as well put in a full FS read in cron. Most kinds of damage cannot be repaired on a live filesystem anyway. Even in ZFS.

> might as well put in a full FS read in a cron

ZFS scrub is not the same thing.

If you do a full-filesystem read in a RAID system at the OS level, the redundant blocks won't be read: the RAID system will simply choose one of the copies to read based on which disk(s) is least heavily loaded at the moment. This is why reading on a 2-disk mirror is twice as fast as reading from a single one of the disks comprising the mirror.

During a ZFS scrub, all copies of every block are checked, and because the data is heavily checksummed, ZFS knows which copy is right if one of the 2+ redundant copies doesn't match its checksum.

Additionally, ZFS is structured as a Merkle tree (https://en.wikipedia.org/wiki/Merkle_tree) which avoids whole classes of ways traditional filesystems can become deranged at a structural level. ZFS always stores 3+ copies of certain types of filesystem metadata, even on a 1-disk ZFS pool, so that if one gets corrupted, it has 2+ others to choose from. When this same type of corruption happens on a traditional filesystem, well, let's just say that's why `/lost+found` exists.

> Most kinds of damage cannot be repaired on a live filesystem anyway.

See my post above, giving two anecdotes of ZFS actively repairing data on live filesystems. Both systems were in continuous use while these repairs proceeded, and no data were lost in either.

> Most kinds of damage cannot be repaired on a live filesystem anyway. Even in ZFS.

You're totally wrong.

The easiest way to demonstrate why is for you to set up a script to randomly write zeros/junk in any amount, at any time, anywhere over one of the block devices being used by ZFS, all day every day.

[Assuming you're using one of the available forms of redundancy i.e. multiple copies, ZRAID1/2, or mirroring etc.]

Sit back and watch ZFS giving no fucks at all as it repairs all the damage passively.

You can even introduce such damage in moderate quantities across all of the block devices used by ZFS. Again, you'll see a goddamn incredible amount of self-healing going on and accurate reporting about where it's unable to recover files due to the damage across multiple volumes being too extensive.

It's unlikely that even in this extreme instance of willful massive harm to the disks you'll see the filesystem being damaged because a) filesystem metadata is checksummed too b) the metadata blocks are automatically stored twice in different places c) you also have the redundancy of multiple devices e.g. mirroring/zraid.

Try it, prove me wrong.