Hacker News new | ask | show | jobs
by aborsy 632 days ago
I am not sure if this is correct. The consensus seems to be, there are a number of related bugs pertaining to ZFS raw send and receive. There seems to be a set of very special circumstances that trigger it. In fact, it’s so rare, that ZFS developers don’t have enough reports and dsta to reproduce and fix it. Moreover, those bugs have not led to data loss (someone may correct me if there are confirmed data loss reports among them).

Otherwise, software always has bugs that you can find their bulletins. Like I use restic and Borg and there are sometimes integrity errors. I have repositories in both with integrity errors in them.

1 comments

I've had a few cases data loss related to ZFS encryption, causing a total loss of a dataset and all of its ancestor snapshots. The key used by this dataset is simply missing from the keystore, and so it fails mounting with I/O error. We have no idea why or how could it happen, but the pool also had a lot of these "innocuous* bugs, while ZFS never reported a single error from the backing disks. This happened on two different full rebuilds (from scratch, using zpool-create and manual recreation of all datasets with rsync) of the same pool, but on the same hardware and with the same workload. I am 99.999% sure that this is caused by the native encryption code, probably compounded by sending very regular snapshots (not raw, though).

Weirdly, this only happened on a few datasets that were not used a lot, the datasets that have lots of IO have only had the innocuous errors (the ones that refer to deleted files).

I did try debugging some of this with a ZFS developer, but we were not able to recover the data, and digged deep enough to see that something was very wrong with these datasets (it was not just a bitflip somewhere, rather that dataset used a key from the keystore that was supposed to exist, but didn't.