Hacker News new | ask | show | jobs
by DvdGiessen 1087 days ago
In production on SmartOS (illumos) servers running applications and VM's, on TrueNAS and plain FreeBSD for various storage and backups, and on a few Linux-based workstations. Using mirrors and raidz2 depending on the needs of the machines.

We've successfully survived numerous disk failures (a broken batch of HDD's giving all kinds of small read errors, an SSD that completely failed and disappeared, etc), and were in most cases able to replace them without a second of downtime (would have been all cases if not for disks placed in hard-to-reach places, now only a few minutes downtime to physically swap the disk).

Snapshots work perfectly as well. Systems are set up to automatically make snapshots using [1], on boot, on a timer, and right before potentially dangerous operations such as package manager commands as well. I've rolled back after botched OS updates without problems; after a reboot the machine was back in it's old state. Also rolled back a live system a few times after a broken package update, restoring the filesystem state without any issues. Easily accessing old versions of a file is an added bonus which has been helpful a few times.

Send/receive is ideal for backups. We are able to send snapshots between machines, even across different OSes, without issues. We've also moved entire pools from one OS to another without problems.

Knowing we have automatic snapshots and external backups configured also allows me to be very liberal with giving root access to inexperienced people to various (non-critical) machines, knowing that if anything breaks it will always be easy to roll back, and encouraging them to learn by experimenting a bit, to the point where we can even diff between snapshots to inspect what changed and learn from that.

Biggest gotchas so far have been on my personal Arch Linux setup, where the out-of-tree nature of ZFS has caused some issues like a incompatible kernel being installed, the ZFS module failing to compile, and my workstation subsequently being unable to boot. But even that was solved by my entire system running on ZFS: a single rollback from my bootloader [2] and all was back the way it was before.

Having good tooling set up definitely helped a lot. My monkey brain has the tendency to think "surely I got it right this time, so no need to make a snapshot before trying out X!", especially when experimenting on my own workstation. Automating snapshots using a systemd timer and hooks added to my package manager saved me a number of times.

[1]: https://github.com/psy0rz/zfs_autobackup [2]: https://zfsbootmenu.org/

1 comments

> Systems are set up to automatically make snapshots

I do that with sqlite to keep a selection of snapshots from the last hours, days etc.

https://github.com/csdvrx/zfs-autosnapshot