Hacker News new | ask | show | jobs
by d33lio 1832 days ago
I'll believe it when I see it, why anyone uses BTRFs (UnRaid or any other form of software raid that isn't ZFS) is still beyond me. At least when we're not talking SSD's ;)

ZFS is incredible, curious to mess around with these new features!

8 comments

BTRFS was useful for me. When those (RAID5) parity patches got rejected many, many years ago for non-technical reasons like not matching a business case/goal or similar, it changed my view of open source.

That was the day I realized that some open source participants and supporters are interested in having open source projects that are good enough to act as a barrier to entry, but not good enough to compete with their commercial offerings.

Judge the world from that perspective for a while and it can help to explain why so much open source feels 80% done and never gets the last 20% of the polish needed to make it great.

> (RAID5) parity patches got rejected many, many years ago

Ooooh. (Booo!)

I wouldn't mind a citation/mailinglist reference for this, if you have one. (I honestly have no idea what I'd Google.)

http://web.archive.org/web/20150301234243/blog.ronnyegner-co...

It's in the quotes from the offline (not on a mailing list) follow up, so it's all hearsay.

I should be clear, I don't necessarily mean I think the developers are complicit in that. I think what happens is more subtle. IE: Companies sponsor the project just enough to be the biggest open source product in the space, but not enough for the developers to make it great.

Or as an alternative conspiracy theory, companies sponsor the projects of great developers that build awesome core features, but never give enough support for someone to turn that into a marketable product. That way they can usurp the work for their own products.

I know there's a lot of speculation there, but, if you watch for it, you can see how most entrenched tech companies are really, really taking advantage of open source developers. Basically the people who are passionate and want to build great things are getting hugely ripped off by people with yachts and rockets.

Thanks very much for replying, and for the link!

One of the biggest differences I've noticed between open source and "the cathedral" is that commercial endeavors that revolve around productization tend (as a rule) to manifest sufficient runway to fully round out the implementation of an idea to the point the implementation can participate in the market cohesively by representing itself attractively/competitively. This is often a broad-spectrum effort that requires domain specialization across a huge number of skills, and the burden of sustaining cohesive focus is typically only viable in a commercial context; I think similar levels of adequate collective focus (many individuals, one goal) are only typically raised in cult-type contexts.

Besides the passion-project foundation you mentioned, a lot of open source seems to come into existence because an $employer needed a really specific thing one time and they let the developer license the code under GPL and here it is and there's the 2.8 pages of documentation and it's got some speling misteaks in it and hopefully it works. (...Woops, I just described NPM, and some percentage of PyPI.)

Very very problematically, there's no collective language in the FOSS scene to distinguish between passion projects and commercially-driven JIT-developed code-dumps. After all, the code probably has just as many bugs per 1,000 lines, and the different contexts produce results that work the same, so...?

IMO, being able to encode that attribution to our communication would make SO MUCH difference in terms of user support, project coordination, etc! Coming from a perspective that's still optimistic :), arguing that "this is great, but it doesn't fit our business use case" translates for me to "my contract doesn't extend to me implementing/grokking/mentally integrating/testing/maintaining this new code, and it's not interesting enough for me to figure it out out of hours either" - so the invitation really is there, "send patches in if they're important enough to you", but it requires working cue perception (and possibly lack of cynicism) in all readers in order to be interpreted correctly. IF this is in fact the message that was being sent (!).

That the followup work was not done does indeed waste the effort made by the patch author, and is generally arguably stupid. But this to me brings up questions about the patch author's motivations, and why they didn't have a go at hammering everything into place - because, assuming fully adequate motivation/stamina and sufficient free time to iterate on the patch until the mailinglist likes it, eventually you'll reach a point where either the patch is in a staging tree somewhere, or the list has exploded into a flamewar about why the patch hasn't been accepted already, at which point (continuing to assume ideal circumstances) the patch author could go follow up on all the raised points.

I guess the outcome depends on whether the patch author considers the above AbSoLuTeLy ToO MuCh WoRk SeRiOuSlY ArE YoU KiDdInG Me, or welcomes the community involvement/participation/feedback and does their best to negotiate it to the point of getting the code merged. That the patch author didn't do this is something only they can provide extra context and judgement about; as you noted about speculation, I could only come up with uncited hypotheses here.

Looping back to the first paragraph, there are indeed many instances, for example in the audio/image/video editing scene, where software availability for Linux is incredibly restricted compared to Windows. The options are there, except they don't really work, or they fall over really quickly, or they feel really clunky. I think this sadly comes down to market demand. For example I've tried poking around with simple audio editing tasks - literally just loading a couple of tracks and crossfading them together - on Linux, and come up blank. Pop culture saturation might also be a factor, for example Renderman and Maya have been around for Linux for ages, but it's sliiightly (I suspect) easier to find "how to <jmp> over the license check" thingys for After Effects or Photoshop, for example.

However, with all of this being said, I have noticed a few industries where the comparison of feature parity in what's available in open source vs what's available commercially is sufficiently vast it directly leads to actual mental disorientation ("wait, this is where things are really at??"). It's incredibly difficult in this situation to stay non-cynical and not draw the types of conclusions you allude to in these settings (that perhaps there are agreements in place to not implement certain features, for example). And it's kind of interesting how "get Photoshop" is kind of a thing - an obscure thing, but still a thing that happens, while this seems (generally speaking) to happen less to Eyewateringly-Expensive Software™ targeted at Linux (?).

I guess this reply was me doing the (googles) denial-anger-bargaining-depression-acceptance thing while reasoning through your comment and trying to not be cynical :D, haha. It certainly is one of those worldview classification grey areas, where it's almost like it can be both things at once (except it can't, because that wouldn't make sense)...

I do also definitely agree that a lot of open-source development work is unfairly leveraged, and additionally that the "learning to code" movement (yay, more labor externalization!) is overhyped, almost to the point of the cult thing I noted above.

For my big media volume, which had existed for around 10 years, I use snapraid.

Because of several things:

* I can mix disk sizes

* I can add new disks over time as needed

* If something dies, up to the entire server, I can just stick any data disk in another system and read it

I didn't want to become a zfs expert (and the learning curve seems steep!), and I didn't want to spend thousands of dollars on new gear (dedicated NAS box and a bunch of matched-size disks).

I repurposed my old workstation into a server, spent a few hours getting it set up, and it works. I've had two disks fail (one data, one parity, and recovered from both). Every time I've added a new disk, it's been 50-100% larger than my existing disks.

I've also migrated the entire setup to a new system (newer old retired workstation), running proxmox, and was pleasantly surprised it only took about an hour to get that volume back up (incidentally, that server runs zfs as well.. I just don't use it for my large media storage volume).

UnRaid and Synology user here and I completely agree with all your points. The knowledge that at worst I will lose the data on just 1 disk (or 2 if I fail during a rebuild) is very calming. If not for UnRaid there is no way I could manage the size of the media volume I maintain (from a time, energy, and money perspective). I mean if you know ZFS well and trust yourself then more power to you but UnRaid and friends fill a real gap.
The funny thing here is that I went with zfs lately because it saved me money. As a poor student trying to maximise gb/$, raid6/lk was the move balance point between capacity and redundancy.
The learning curve of ZFS compared to every alternative out there is significantly lower IMO. The interface is easier and the guides online are great.

There are drawbacks as the one discussed here, but as a Linux user who doesn’t want to mess up with the FS and uses ZFS for the backup server, the experience has been great so far.

I just put two 8TB drives into btrfs because it's a home server, I can't provision things up front. One day I may put a third 8TB drive and turn this RAID1 into RAID5. btrfs lets me do that, zfs doesn't, simple as.

One day I may switch the whole thing to bcachefs, which I've donated and am looking forwards to. For the moment, btrfs will have to do.

EDIT: downvoted by... the filesystem brigade?

I disagree with this statement on multiple fronts. On the first level of the onion RAID1 is for high reliability and BTRFS has historically low reliability but if you peel back that layer you are presenting the ability to transition from BTRFS RAID1 to RAID5 as an appealing feature of BTRFS vs ZFS and yet this just isn't so.

BTRFS has been promising usable RAID5 since 2009 when it was "heading for 1.0" and yet among the most recent developments not but 3 months ago was to add the following warning to btrfs-progs on creation or conversion.

"RAID5/6 support has known problems is strongly discouraged to be used besides testing or evaluation,"

Worse this feature was presented as usable around 2011/12 before being revealed to be unfixably data eating without substantial rewrites in 2016 and 5 years later remains so.

Your hardware might need to be replaced before you can avail yourself of the benefit you posit.

Meanwhile an approach that would actually work on both BTRFS and ZFS would be to add 2 drives to go from RAID1 to RAID10.

The last peel of onion is complaining about down votes. This invites more down votes. If I had to guess people down voted you because you presented a feature that has been a massive pain point for BTRFS as a proposed advantage.

>RAID5

I wish you lots of fun with that on btrfs :)

Edit:

https://btrfs.wiki.kernel.org/index.php/Status

RAID56 Unstable n/a write hole still exists

> treated as if I'm storing business data or precious memories without backups, guess I'm just dumb

No your not, but don't use unstable features in a filesystem

Well, that's the idea! This a low I/O media server where all the important stuff (<5G of photos) has 2+ redundancy, once remotely, and on every workstation I sync, with the rest of the data being able to crash and burn without much repercussion.

The whole point of me using RAID1 (and maybe later RAID5) is that if a disk goes bust, odds are I can still watch a movie from it until I can get another disk. What's more, if I ever fill the RAID1 and I don't feel like breaking the piggy bank for another disk, I can go JBOD as far as my usecase is concerned.

But hey, if the orange website tells me all servers are supposed to be treated as if I'm storing business data or precious memories without backups, guess I'm just dumb. On that note: donations welcome, each 8TB disk costs close to 500 USD here in Uruguay, so if anyone's first world opinion can buy me a couple so I can use the Right Filesystem™, I'd appreciate it!

Look i really don't care was FS you use, just don't use unstable features. If you pay so much for your disks JUST for your movies, it seams they have some value for you.

>so if anyone's first world opinion can buy

Oh buhuu, says the Guy who can afford a NAS for his movies and >2 workstations, stop with your wannabe victim role.

At 500 a disk you could fly a consultant out to demonstrate a high capacity storage solution consisting of many $200 disks who just happens to forget to bring the disks back!
There are a large group of people who really dislike BTRFS. I think they were probably burned by it at some point but I’ve never had trouble and I’ve been using it since it became the default on fedora.
btrfs does have some advantages over zfs

   - no data duplicated between page cache and arc
   - no upgrade problems on rolling distros
   - balance allows restructuring the array
   - offline dedup, no need for huge dedup tables
   - ability to turn off checksumming for specific files
   - O_DIRECT support
   - reflink copy
   - fiemap
   - easy to resize
>- ability to turn off checksumming for specific files

This is something that has always confused me. BTRFS users are always advised to disable copy-on-write (thus preventing checksumming or compression) for VM images or database files to avoid massive performance hits from fragmentation. Even Facebook still stores its databases mostly on XFS filesystems. However, the ZFS community seems to indicate that you can achieve reasonable performance for databases and VMs just by tuning the recordsize (e.g. https://pg.uptrace.dev/zfs/). How does ZFS mitigate the problems from fragmentation?

But the main thing of an fs is to preserve your files...btrfs can't even check the most important point.
The checksumming helps to spot faulty hardware, that's a step above most other filesystems and often smart info too.
Checksums don't help against bugs. You are much less likely to lose your whole disk with ext4 or ZFS than BTRFS.
I see this a lot but have never had problems with BTRFS and I’ve used it both on my larger disks (2+tb) and my root (250gb ssd) across multiple computers for the last four years.
And even included in the kernel

    - defragmentation
Some of these are fair points but zfsonlinux/OpenZFS has had O_DIRECT since 0.8.x.
ZFSOnLinux just ignores the O_DIRECT flag if I remember correctly. Granted, this is what btrfs should do by default as well since there is an ugly issue where software can modify the O_DIRECT buffer after it was submitted causing btrfs checksum errors even though nothing was corrupted (and there is nothing to be done about it except disabling O_DIRECT or creating a buffer copy).
Yes (and as you mention this is the right thing to do) but I think it means that the Linux kernel will bypass the page cache which is still useful.
> why anyone uses BTRFs (UnRaid or any other form of software raid that isn't ZFS) is still beyond me.

BTRFS can do after-the-fact deduplication (with much better performance than ZFS dedup) and copy-on-write files. And you can turn snapshots into editable file systems.

I've had 3 catastrophic BTRFS failures. In two cases, the root filesystem just ran out of space and there was no way to repair the partition. Last time, the partition was just rendered unmountable after a reboot. All data was lost.No such thing has ever happened with ZFS for me.
A recent Fedora install here came with a new default of BTRFS use rather than ext4. So i'm curious about your experience, were any of those catastrophic failures recent? Do you know of any patches entering the kernel that purport to fix the issues you experienced?
Last one was two years ago. I was told that it was a hardware issue. Same SSD is still going strong with ext4 now.
I've had some annoying failures too. But I wasn't listing pros and cons, I was explaining that there are some very notable features that ZFS lacks.
That's fair. However, when listing notable features for the sake of comparing software, I think it's important to also list other characteristics of a given piece of software. If we were to compare software by feature sets alone, one might argue that Windows has the most features, so Windows must be best OS.
I think cloning a zfs snapshot into a writeable filesystem matches at least the functionality of btrfs writeable snapshots, but I could be ignorant about some use-cases.
Let's say you want to clear out part of a snapshot of /home, but keep the rest.

So you clone it and delete some files. All good so far, but the snapshot is still wasting space and needs to be deleted.

But to make this happen, your clone has to stop being copy-on-write. All the data that exists in both /home and the clone will now be duplicated.

And you could say "plan ahead more", but even if you split up your drive into many filesystems, now you have the problem that you can't move files between these different directories without making extra copies.

To put it in other words, zfs doesn't support rebasing a clone off of a newer snapshot. Otherwise e.g. you could create a clone, snapshot each clone, create two new clones and promote them, and then delete the original snapshot. But what zfs does is re-attach the original snapshot to the promoted clone of the original volume, and it is still the referred base of the other clone.
I’m a beginner in ZFS, but copying the modified clone and then destroying the clone and the snapshot would solve your problem, wouldn’t it?
Licensing. Similarly, otherwise it would've been included in macOS a long time ago (as the default fs according to some..)
The reason it didn’t end up in macOS is because NetApp sued Sun for patent infringement. Apple wanted nothing to do with that lawsuit and quickly abandoned the project.

As others have stated, dtrace has the exact same license and has been in MacOS for years.

The licensing is nothing to do with it on OSX - indeed DTrace (also under the CDDL) has been shipping in it for years.
And it's arguably even a bigger issue on Linux distros.
It’s a moderate pain on Linux and then only really that if you’re running on something bleeding-edge like Arch. Otherwise it’s just a kernel module like any other.
But it doesn't ship with either Red Hat or SUSE distros, which is an issue for supported commercial use.
Whats Oracle's play here, do they somehow make money out of ZFS which makes them reluctant to re-license it?
Is there a CLA for OpenZFS/ZoL? I don't believe there is, so I don't think Oracle can unilaterally relicense it.
Even if there were a CLA for OpenZFS, it wouldn't affect Oracle's inability to relicense the whole thing.

They could relicense their codebase, of course, but the number of changes that have happened since they diverged is not small.

A CLA and copyright assignment was how Oracle were able to make (now Oracle) ZFS proprietary again in the first place. As you say though, OpenZFS and Oracle ZFS have diverged quite a bit, and most of the world is now based around the OpenZFS variant that acts as the upstream for Linux, FreeBSD and even Windows variants.
I do believe that the license was fine for macOS but when Oracle bought Sun that killed it cold.

Jobs never liked anybody other than himself holding all the cards. Having Ellison and Oracle holding the keys to ZFS was just never going to fly.

It's a combination of the license and the fact that it's Oracle, of all entities, that owns the copyright. Perhaps either one by itself wouldn't be a dealbreaker but the combination is. And, of course, Oracle could have changed the license at any time after buying Sun.

(Of course, Jobs may have just decided he didn't want to depend on someone else for the MacOS filesystem in any case.)

ADDED: And as others noted, there were also some storage patent-related issues with Sun. So just a lot of potential complications.

That makes absolutely no sense. Jobs and Ellison were best friends. Oracle acquiring Sun would have made it MORE attractive, not less.

https://www.cnet.com/news/larry-ellison-talks-about-his-best...

I had ZFS on a Mac from Apple for a short amount of time during one of the betas :( I think TimeMachine was going to be based on it but they pulled out.
FYI there is a third-party effort for making OpenZFS usable on macOS.

https://openzfsonosx.org/

I used it for a while but unfortunately since they are not many people working on this and they are not working on it full time it can take them a good while from a new version of macOS is released until OpenZFS is usable with that version of macOS. This was certainly the case a while ago and why I stopped using OpenZFS on macOS and went back to only using ZFS on FreeBSD and Linux instead of additionally using it on macOS. So with my Mac computers I only use APFS.

Jobs and Ellison were really close friends
And also cold hearted clear eyed businessmen unlikely to allow friendship to affect their corporations.

I’d love to be a fly on the wall for some of those conversations.

Simplicity. There's a lot of complexity in ZFS I'd rather not depend on, and because it does so many things it's a big investment and liability to switch to.

While I understand why it would be useful in a corporate setting, for personal use I've found the combination of LUKS+LVM+SnapRAID to work well and don't see the benefit of switching to ZFS. Two of those are core Linux features, and SnapRAID has been rock solid, though thankfully I haven't tested its recovery process, but it seems straightforward from the documentation. Sure I don't have the real-time error correction of ZFS and other fancy features, but most of those aren't requirements for a personal NAS.

What about if you were just starting today, with 0 knowledge about basically anything related to storage and how to do it right?

That's my case, I'm learning before setting up a cheap home lab and a NAS, and I'm wondering if biting into ZFS is just the best option that I have given today's ecosystem.

I was in the same place 6 or 7 years ago. Due to indecision, I ended up using btrfs, zfs, and mdadm (technically, Synology hybrid raid) on various devices. They all work, more or less.

Looking back, the lessons that come to mind are:

- Always have 2 backups (not counting the primary copy), at least 1 "cold" (inaccessible without human intervention) and at least 1 offsite. Backup frequently and retain old backups. With backups, bad decisions are reversible.

- With btrfs or zfs, using a collection of 2-disk mirrors was useful because it provided flexibility (to expand the array, just add another pair of disks) and seemed to have better performance than a single disk. Try to pair disks from different manufacturing batches though. I saw two disks from the same batch and _used in the same mirror_ fail in the same month, which was disconcerting.

- The only data corruption I had to deal with was from RAM that started off good and went bad after a couple years.

- Standardizing on btrfs or zfs from the beginning would have allowed backup by sending snapshots, which would have been a lot easier than cobbling together a solution using rsync.

- Scrub on a regular schedule. Set up monitoring software to notify you of the outcome of each scrub and of any SMART errors.

Thank you. I need to start small, otherwise I feel overwhelmed by too many moving pieces to keeo in mind and plan for.

So I'm starting small, from powering up a ThinkCentre M910 I had laying around, with an internal disk that can be used to store backups. I have 0 need for performance so my idea was to extend storage with an external USB3 HD enclosure. For now, I don't have the space nor the machine where to install dual hard disks for building a decent RAID. Time will tell.

> That's my case, I'm learning before setting up a cheap home lab and a NAS, and I'm wondering if biting into ZFS is just the best option that I have given today's ecosystem.

ZFS is the simplest stack that you can learn IMHO. But if you want to learn all the moving parts of an operating system for (e.g.) professional development, then more complex may be more useful.

If you want to created a mirrored pair of disks in ZFS, you do: sudo zpool create mydata mirror /dev/sda /dev/sdb

In the old school fashion, you first partition with gdisk, then you use mdadm to create the mirroring, then (optionally) LVM to create volume management, then mkfs.

I dove into ZFS for my home lab as a relative novice.

It's not terrible, but there are a few new concepts to come to grips with. Once you have them down, it's not terrible.

If you don't plan on raiding, IMO, ZFS is overkill. The check-summing is nice, but you can get that from other filesystems.

Maintenance is fairly straight forward. I've even done a disk swap without too much fuss.

The biggest issue I had was setting up raid z on root with ubuntu was a PITA (at the time at least, March of this year). I ended up switching over to debian instead. Once setup, things have been pretty smooth.

Two things I like from it, as per what I've read so far:

* Checksumming

* As you mention, easy maintenance

* Snapshots and how useful they are for backups

In the end what I value is stuff that works reliably, doesn't get in the way, and requiring minimal supervision. And in the particular case of FS, I'd like to adopt a system that helps avoid bitrot in my data.

Could you drop some names that you would consider as good alternatives of ZFS?

For close to ZFS feature parity but much younger, BTRFS.

Otherwise it's sort of figuring out what features you want to drop. XFS and ext4 are probably where I'd look for a single disk hard drive.

Like I said, you could do ZFS, but definitely feels a bit like overkill. Setting up a vdev with one disk just to get snapshots and checksums seems like a lot.

I would still go with a collection of composable tools rather than something monolithic as ZFS, and to avoid the learning curve. But again, for personal use. If you're planning to use ZFS in a professional setting it might be good to experiment with it at home.
As mentioned in the sibling comment, one thing I like is having systems that don't require me to supervise, fix things, etc. In part that's why I've been alwas a user of ext4, it just works.

But I've recently found bitrotin some of my data files and now that I happened to be learning about how to build a NAS, I wanted to make the jump to some FS that helps me with that task.

Could you mention which tools you would use to replace ZFS? Think of checksumming, snapshotting, and to a lesser degree, replication/RAID.

I would argue that a collection of mostly composable tools can easily be much more complex (and bug-prone!) than a single “monolith”. Less moving parts can be good sometimes and I would argue that a file system/volume management is a very compact problem domain where better integration between the tools is more important than extendibility.
> LUKS+LVM+SnapRAID

+ your fs

Yeah that sounds like a lot less complexity

ZFS has all of these features and more. If I don't need those extra features by definition it's a less complex system.

Using composable tools is also better from a maintenance standpoint. If tomorrow SnapRAID stops working, I can replace just that component with something else without affecting the rest of the system.

> If tomorrow SnapRAID stops working, I can replace just that component with something else without affecting the rest of the system.

Can you actually? If some layer of that storage stack stops working then you can no longer access your existing data, because all these layers need to work correctly to correctly reassemble the data read from disk.

It's a hypothetical scenario :) In reality if there's a project shutdown there would be enough time to migrate to a different setup. Of course it would be annoying to do, but at least it's possible. With a system like ZFS I'm risking having to change the filesystem, volume manager, storage array, encryption and whatever other feature I depended on. It's a lot to buy into.
Since all those tools are from different dev's the system gets more complex. But hey if you really think that ZFS is to complex to hold 55 petabytes because it has to many potential bugs you should tell them:

https://computing.llnl.gov/projects/zfs-lustre

Thankfully I don't have to manage 55 petabytes of data, but good luck to them.

Did you miss the part where I mentioned "for personal use"?

> Since all those tools are from different dev's the system gets more complex.

I fail to see the connection there. Whether software is developed by a single entity or multiple developers has no relation to how complex the end user system will be.

But many small tools focused on just the functionality I need allows me to build a simpler system overall.

>Did you miss the part where I mentioned "for personal use"?

Since ZFS is simpler to use then your setup, is used to store 55PB of data without a single bit error since 2012, i don't see why someone should use inferior stuff, even when it's "personal use".

>But many small tools focused on just the functionality I need allows me to build a simpler system overall.

Sometimes monoliths are better for example the network-stack and storage....maybe kernels (big Maybe here)

> Whether software is developed by a single entity or multiple developers has no relation to how complex the end user system will be.

The first part of this sentence is probably true, as far as I see, but the complexity of a system perceived by the user depends primarily on the "surface" of the system. That surface includes the UI, the documentation and important concepts you have to understand for effective usage of the system. And in that regard, ZFS wins hands down against LUKS + LVM + SnapRaid + your FS of choice. Some questions a user of that LVM stack has to answer, aren't even asked of a ZFS user. E.g. the question how to split the space between volumes or how to change the size of volumes.

RAM?

Everytime I looked into setting up a freenas box, every hardware guide insisted that ungodly amounts of absolutely-has-to-be-ECC RAM was essential, and I just gave up at that point.

The "you need at least 32GB of memory and it has to be ECC, or don't even bother trying to use ZFS" crowd has done some serious harm to ZFS adoption. Sure, that's what you need if you want excellent data integrity guarantees and to use all of ZFS' advanced features. If you're fine with merely way-better-than-most-other-filesystems data integrity guarantees and using only most of ZFS' advanced features, you don't need those.
I really don't know where the "You gotta have ECC RAM!" thing started. I've been running a ZFS RAID on Nvidia Jetson Nanos for years now and haven't had any issues at all with data integrity.

I don't see why ZFS would be more prone to data integrity issues spawning from a lack of ECC than any other filesystem.

Relevant quote from one of ZFS's primary designers, Matt Ahrens: “There's nothing special about ZFS that requires/encourages the use of ECC RAM more so than any other filesystem. ... I would simply say: if you love your data, use ECC RAM. Additionally, use a filesystem that checksums your data, such as ZFS."
Yeah, I remember reading that a few years ago.

If I were running a server farm or something, then yeah, I'd probably use ECC memory, but I think if you're running a home server, then the argument that ZFS necessitates ECC more than Ext4 or Btrfs or XFS or whatever doesn't really seem to be accurate.

> the argument that ZFS necessitates ECC more than Ext4 or Btrfs or XFS or whatever doesn't really seem to be accurate

Agreed.

> If I were running a server farm or something, then yeah, I'd probably use ECC memory, but I think if you're running a home server

Then you should still use ECC RAM, regardless of what filesystem you're using.

No, really. ECC matters (https://news.ycombinator.com/item?id=25622322) generally.

Years ago I saw it at:

https://www.truenas.com/community/threads/ecc-vs-non-ecc-ram...

(the gist of the scary story is that faulty ram while scrubbing might kill "everything".) However, in the end ECC appears to NOT be so important, e.g., see

https://news.ycombinator.com/item?id=23687895

There is literally only one feature that uses massive amounts of memory. Online de duplication relies on keeping an in ram table of duplicated blocks. This means that more duplication you have the larger the table is.

FreeBSD Mastery: ZFS by Michael Lucas around pg 174

Deduplication Memory Needs ==========================

"For a rough-and-dirty approximation, you can assume that 1 TB of deduplicated data uses about 5 GB of RAM. You can more closely approximate memory needs for your particular data by looking at your data pool and doing some math. We recommend always doing the math and computing how much RAM your data needs, then using the most pessimistic result. If the math gives you a number above 5 GB, use your math. If not, assume 5 GB per terabyte."

https://www.tiltedwindmillpress.com/?product=fmzfs

This is not to say you need 5GB for every 1TB of data. It doesn't even mean you need 5GB of data for every 1TB for which you have enabled dedup it means you need approximately 5GB of data for each TB of data which is both duplicated and residing on a dataset for which you have enabled dedup. Because of the high memory cost of dedup which rises exactly in proportion to its utility its only useful in cases in which you can plan ahead for its requirements. 99% of users are unlikely to use dedup however this doesn't stop some, not you obvious, from promoting the idea that ZFS requires 5GB of memory per TB or some some absurd figure.

As an aside I really liked the book I found it easy to read and understand and very informative despite being focused on FreeBSD its mostly applicable to Linux as well.

Heh so you have that backwards. All RAM should be ECC if you care about what’s stored in it. It’s not a ZFS requirement, it’s just that ZFS specifically cares about data integrity so it advises you to use ECC RAM. But it’s not like any other file system is immune from random RAM corruption: it’s not, it just won’t tell you about it.
Neither quantity nor ECC is essential.

ZFS defaults to assuming it is the primary reason for your box to exist, but it only takes two lines to define more reasonable RAM usage: zfs_arc_min and zfs_arc_max. On a NAS type server, I would think setting the max to half of your RAM is reasonable. Maybe 3/4 if you never do anything except storage.

ECC is not recommended because ZFS has some kind of special vulnerability without it; ECC is recommended because ZFS has taken care of all the more likely chances of undetectable corruption, so that's the next step.

It is not that simple regarding ECC. Since ZFS uses more memory, the probability of hitting a memory bug is simply higher with it.
But it doesn’t really use more memory. The ARC gives the impression of high memory usage because it’s different than the OS page cache and usually called out explicitly and not ignored in many monitoring tools like the OS cache is. Linux—without ZFS—will happily consume nearly all RAM with any filesystem if enough data is read and written.
This is correct. Any filesystem using the kernel's filesystem cache will do this, too.

For a long running, non-idle system, a good rule of thumb is that all RAM not being actively used is being used by evictable caching.

A colleague who was used to other UNIXes was transitioning to Linux for a database. He saw in free that used was more at more than 90%, so he added more ram. But to his surprise it was still using 90%! He kept adding ram. I told him that he had to subtract the buffer and cached values (this was before free had the Available column).
ZFS likes RAM and uses it to get better performance (and don't think about using dedup without huge ram), but you don't need it and can change the defaults.

ECC tends to attract zealots after a perfect error-free existence which ECC does tend towards but doesn't deliver, it just reduces errors. I personally don't care about a tiny amount of bit rot (zfs will prevent most of this) and rebooting my storage machine now and then.

You can run ZFS/freenas on a crappy old machine and you'll be just fine as long as you aren't hosting storage for dozens of people and you aren't a digital archivist trying to keep everything for centuries.

Real advice:

* Mirrored vdevs perform way better than raidz, I don't think the storage gain is worth it until you have dozens of drives

* Dedup isn't worth it

* Enable lz4 compression everywhere

* Have a hot spare

* You can increase performance by adding a vdev set and by adding RAM

* Use drives with the same capacity

> Dedup isn't worth it

To add to that, ZFS dedup is a lie and you should forget its existence unless you have a very specific scenario of being a SAN with a massive amount of RAM, and even then, you had better be damn sure.

I really wish ZFS had either an option to store the Dedup Table on a NVMe like Optane, or to do an offline deduplication job.

It does have the former, these days - the "allocation_classes" feature lets you make the permanent home of certain subsets of data on "special" vdevs - which includes methods of specifying "store dedup table there".

Now, that becomes the only place entries on it are stored, so you best make it redundant if you don't want to lose your pool from a single NVMe failing, but the feature is there.

The latter I would predict seeing approximately when the sun burns out, on ZFS. It _really_ doesn't like the idea of data changing locations retroactively.

Thanks for this. I completely missed this feature in the run up to 0.8.

I'm going to have to do some test setups with this.

> Enable lz4 compression everywhere

Is the perf penalty low enough now that it just doesn't matter? I've always disabled compression on datasets I know are going to store only high-entropy data, like encoded video, that has a poor compression ratio.

I second the hot spare recommendation many times over. It can save your bacon.

It's generally the other way around actually, aside from storing already highly compressed datasets (e.g. video). The compression from lz4 will get you better effective performance because of the lower amount of io that has to be done, both in throughput and latency on zfs. This is because your CPU can usually do lz4 at hundreds of gb/s compared to the dozen you might get on your spinning rust disks.
Neat! Makes sense.
Does rebooting help with soft errors in non-ECC RAM? I would have thought bit flips would be transient in nature, but I'm not really familiar.
Running ZFS (FreeNAS/TrueNAS) on 2 home made NAS devices for years and years, I can say it is rock solid without ever using ECC RAM due to lack of choices. I can bet there were many soft-errors in all these years, but so far I never had problems that could not be recovered; the biggest issue ever was destroying the boot USB storage in months, but that was partially solved lately, I moved to fixed drives as boot drive and later I moved to virtualization for boot disk and OS, so the problem completely went away.
occasionally a bit flip will corrupt the state of something important and long running, a reboot will obviously clear this

usually it will hit nothing and have no side effects

You really only end up needing that if and only if you're also going to do live deduplication of large amounts of data. Very few people actually need that, just using compression with lz4 or zstd depending on your needs will suffice for just about everyone and perform better. the ECC argument is probably about a 50/50 kind of thing, you can get away without it and ZFS will do it's best to detect and prevent issues but if the data was flipped before it was given to ZFS then there's nothing anyone can do. You might get some false positives when reading data back if you got some flaky ram but as long as you have parity or redundancy on the disks then things should still get read correctly even if a false problem is detected. That might mean you want to run a scrub (essentially ZFS's version of fsck) more often to look for potential issues but it shouldn't fundamentally be a big deal. If you end up wanting 24/7 highly available storage that won't blip out occasionally you'll probably really want the ECC ram but if you're fine with having to reboot it occasionally or tell it to repair problems that it thinks were there (but weren't because the disk is fine but the ram wasn't) then you should be fine. The extra checksums and data that ZFS can use for all this can make it really robust even on bad hardware. I had a bios update cause some massive PCIE bus issues that I didn't realize were going on for a bit and ZFS kept all my data in good condition even though writes were sometimes just never happening because of ASPM causing issues with my controller card.
Others have said good things (ECC is good by itself, has not much to do with ZFS) and it is actually quite easy to check if you need much RAM for ZFS. Start a (Linux) VM with a few hundred megabytes of RAM and run ZFS an on it. Of course, it will not be as performant as having a lot of RAM. But it will not crash, or hang or be unusable in one way or another.

Sources: - https://www.reddit.com/r/DataHoarder/comments/3s7vrd/so_you_... - https://www.reddit.com/r/homelab/comments/8s6r2r/what_exactl... - My own tests with around 8 TB ZFS data in a Linux vm with 256 MB RAM.

As always, it depends on your use-case.

I have several file-servers all use ZFS exclusively. and 10x that number of servers using ZFS as the system FS.

Rule of thumb that I like: 1GB RAM/TB of storage. This seems to give me the best bang-for-our-buck.

For a small (under 20) number of office users, doing general 'office' stuff, using Samba, it's overkill.

For large media shares with heavy editor access, and heavy strains on the network, it's a minimum.

Depends on what the server is serving.

DeDUP is a different story. The RAM is used to store the frequently accessed data. If you are using DeDUP you fill the motherboard with as much RAM as will fit. NO EXCEPTIONS! This may have been the line of thinking that scared you away from it.

I have a 100TB server that is just used for writing data to and is never read from (sequential file back-ups before it's moved to "long term storage"). It has 8GB of RAM, and is barely touched.

I also have a 20TB server with 2TB of RAM, that keeps the RAM maxed out with DeDUP usage.

ECC: It's insurance, and it's worth it.

That's not precisely why dedup needs gobs of RAM. (If you already know this distinction, I apologize, I just want to make sure people reading this do.)

You effectively (unless you use allocation classes) need to keep the entire DDT in RAM all the time if you don't want any write to a dedup-enabled dataset to potentially require blocking on reading the relevant segment from spinning disks into RAM (thus tanking performance even worse than dedup normally does). It's not really related to the mechanisms in the rest of ZFS for keeping {frequently,recently} used data cached in RAM.

The freenas hardware requirements themselves say "8 GB RAM (ECC recommended but not required)"

https://www.freenas.org/hardware-requirements/

I myself use freenas with 16GB of non-ECC ram.

Of course it is possible to have a bit flip in memory that is then dutifully stored incorrectly by ZFS to disk, but this was a possibility without ZFS as well.

I've actually been waiting for this feature for since I first setup my pool. It seemed theoretically possible we were just waiting for an implementation.

FreeNAS is excellent in many ways. Except that weird gospel their forum people have.

ZFS only needs a lot of RAM if deduplication is enabled. And it shouldn't be for most use cases, or only enabled on one dataset that benefits from it.

Many ZFS installs are fine on 8GB or less.

ECC RAM is better but not required. The idea is to catch memory errors, hence ECC is better.