| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by rsync 2341 days ago

"Indeed it appears that the option they needed existed, but only in a later version of ZFS than they were running, and part of the fix was moving the broken array to a system that could run a newer version of ZFS, which apparently was itself not trivial."

I have not read this post-mortem yet, but I can attest that this is a viable strategy.

As many know, rsync.net is built entirely on ZFS.

While we have never come close to a blown array (we use extremely conservatively configured raidz3 vdevs) what we have seen are weird corner cases where suddenly a 'zfs destroy' or even a common 'rm' deletion of hundreds of millions of files will either take forever (years) or will halt the (FreeBSD) system.

In one of these cases, after several days of degraded performance and intermittent outages, we did an alternate boot to a newer FreeBSD version with a newer, production, release version of ZFS, and the operation completed in a timely and graceful manner.

---

What we continue to learn, decade after decade, from UFS2 through to ZFS, is that extremely simple infrastructure configuration is resilient and fails in predictable and boring ways.

We could gain so much "efficiency" and save a lot of money if we did common sense things like bridge zpools across multiple JBODs or run larger vdevs, etc. - but then we'd find ourselves with fascinating failures instead of boring ones.