Hacker News new | ask | show | jobs
by zaroth 2341 days ago
I’m no ZFS expert, but it must have been incredibly stressful, if not mildly terrifying, going that far down the rabbit hole with customer data on the line.

I have a bad feeling someone is going to read their write up and tweet at them, “Why didn’t you use -xyz switch, it fixes exactly this issue in 12 seconds”.

Indeed it appears that the option they needed existed, but only in a later version of ZFS than they were running, and part of the fix was moving the broken array to a system that could run a newer version of ZFS, which apparently was itself not trivial.

2 comments

"Indeed it appears that the option they needed existed, but only in a later version of ZFS than they were running, and part of the fix was moving the broken array to a system that could run a newer version of ZFS, which apparently was itself not trivial."

I have not read this post-mortem yet, but I can attest that this is a viable strategy.

As many know, rsync.net is built entirely on ZFS.

While we have never come close to a blown array (we use extremely conservatively configured raidz3 vdevs) what we have seen are weird corner cases where suddenly a 'zfs destroy' or even a common 'rm' deletion of hundreds of millions of files will either take forever (years) or will halt the (FreeBSD) system.

In one of these cases, after several days of degraded performance and intermittent outages, we did an alternate boot to a newer FreeBSD version with a newer, production, release version of ZFS, and the operation completed in a timely and graceful manner.

---

What we continue to learn, decade after decade, from UFS2 through to ZFS, is that extremely simple infrastructure configuration is resilient and fails in predictable and boring ways.

We could gain so much "efficiency" and save a lot of money if we did common sense things like bridge zpools across multiple JBODs or run larger vdevs, etc. - but then we'd find ourselves with fascinating failures instead of boring ones.

I don't have hundreds customers, but I handle hundreds of TiB of data (for a science lab).

Issues occur from time to time, and I can assure that these times are very stressful. I am grateful to rely on ZFS, because yet I have never lost any data from people (datasets are often around 10TiB).

No offsite backups? Backblaze B2 is exceedingly cheap for example.
The animation studio I worked for had almost a petabyte of data. It may be cheap to buy the storage but transferring is costly. It's very easy to saturate a MPLS circuit with data, even rSync on a 10Gbit internal connection takes a long while.

Really Gandi should of had backups from day one. If your hosting data you should always have backups ready and tested on day one.

rsync went quite fine while transferring data (in the same situation as you describe), when taken care of some important bottlenecks (not running it over SSH, disabling compression on files that don't compress well, disabling full checksums, TCP sockopts, ...)

what it might leave you hanging with for a long time is before an actual transfer, while it builds and compares the lists on both sending and receiving side, when you have big filesystems (hundreds of millions of files).

if you have a strategy to select beforehand which files to transfer (for example from a DB which tracks what has been created or changed, direct from worker or production input) you have a good headstart and can minimize rsync on complete filesystems -- and rather run it on a selection, which is tiny compared to the complete project(s) most of the time.

I’d be very curious if you evaluated the post-MPLS guys like Megaport for connectivity.
Hey, good business idea: backup storage in vans! :)
AWS beat you to it[0]. Not a van, but a 45 foot trailer.

[0] https://aws.amazon.com/snowmobile/

That's impressive!
My lab generates similar-sized data sets and the transfer, more than the at-rest storage, is tough.

If our internet (or Box's datacenter) were slow, we could easily collect data faster than we could send it to our collaborators.

We do have backups, but getting back your hundred's TiB is really long: you want to keep it where they are living.