Hacker News new | ask | show | jobs
by kiney 285 days ago
btrfs has many technical advantages over zfs
2 comments

Yes, like destroying itself and losing all data.
ZFS is perfectly capable of this too.

source: worked as a support engineer for a block storage company, witnessed hundreds of customers blowing one or both of their feet off with ZFS.

To what extent are these customers blaming the hammer for hitting their thumb?

(Legitimate question: I manage several PB with ZFS and would like to know where I should be more cautious.)

A great deal. Which is why my cringe reflex still activates when I read about people running ZFS in places that aren't super tightly configured. ZFS is just such a massively complex piece of software.

There were legitimate bugs in ZFS that we hit. Mostly around ZIL/SLOG and L2ARC and the umpteen million knobs that one can tweak.

Customers blowing off their feet with ZFS because they felt the need to tweak tunables they didn’t need to use, or didn’t properly understand, is not the fault of ZFS though.

You can do the same with just about any file system. In the Windows world you can blow your feet off with NTFS configuration too.

Of course there have been bugs, but every filesystem has had data-impacting bugs. Redundancy and backups are a critical caveat for all file systems for a reason. I once heard it said that “you can always afford to lose the data you don’t have backed up”. I do not think that broadly applies (such as with individuals), but it certainly applies in most business contexts.

Yeah, my reaction to it usually that's so quickly recommended so frequently for general use.

Obviously there's footguns in everything. Filesystem ones are just especially impactful.

> A great deal. Which is why my cringe reflex (...)

Can you provide some specifics? So far all I see is vague complains with no substance, and when complainers are lightly pressed they go defensive.

I don't have specifics for how many people running a fork of ZFS on Linux (or the fork for opensolaris, nexenta, etc) have copy-pasted some configuration from a wiki/forum/stackexchange and resulted in a pool that's misconfigured in some subtly fatal way. I don't have any personal anecdotes to share about my own homelab or enterprise IT experience with ZFS because I don't use it at home and nowhere I've worked in IT has used it.

I did live specific situations over several years in a support engineer role where a double digit percentage of customers in enterprise configurations that ended up somewhere between terrible performance and catastrophic data loss due to the misunderstood configuration of a very complex piece of software.

If you wanna use ZFS, use ZFS. I'm not the internets crusader against it. I have no doubt there's thousands of PB out there of perfectly happy, well configured and healthy zpools. It has some truely next-gen features that are extremely useful. I've just seen it recommended so, so many times as a panacea when something simpler would be just as safe and long lasting.

It's kinda like using Kubernetes to run a few containers. Right?

Pool feature mismatch on send receive, dedup send receive, new features breaking randomly on bleeding edge releases
The intent of feature flags in ZFS is to denote changes in on-disk structures. Replication isn’t supported between pools that don’t support the same flags because otherwise ZFS couldn’t read the data from disk properly on the receiving sides.

There are workarounds, with their respective caveats and warnings.

> source: worked as a support engineer for a block storage company, witnessed hundreds of customers blowing one or both of their feet off with ZFS.

The phrasing of this tends me to believe that the customers set up ZFS in a 'strange' (?) way. Or was this a bug(s) with-in ZFS itself?

Because when people talk about Btrfs issues, they are talking about the code itself and bugs that cause volumes to go AWOL and such.

(All file systems have foot-guns.)

Mostly customers thinking they fully understand the thousands of parameters in ZFS.

There was a _very_ nasty bug in the ZFS L2ARC that took out a few PB at a couple of large installations. This was back in 2012/2013 when multiple PBs was very expensive. Was a case of ZFS putting data from the ARC into the pool after the ZIL/SLOG had been flushed.

Can you give an example because to me it always appeared as NIH copy-cat fs?