Yeah, right now, it's actually not known whether roots of this bug dates back to even the Sun days. And because Oracle ZFS is not open source we can't know if this or other bugs are lurking.
It seems to require certain extremely specific situations to trigger it. Like using copy_file_range, which is the new feature which exposed it, or writing a file and then writing a larger than recordsize hole in that file in the same transaction group.
Another contributor who is watching this more closely informed me that the issue appears to predate Oracle’s acquisition of Sun. While this is bad, it at least suggests that this bug is very rare.
The code has never been formally verified, so there was always a possibility of such a bug existing. Without formal verification, it is possible that more such bugs will be found. I should add that there is no formally verified production ready storage stack, so not using ZFS would not eliminate the risk of hitting bugs like this. :/
Formal verification seems more appropriate for finished software not undergoing development or feature changes. It's the last step before software is set permanently in stone, unchanging, forever.
It's worth noting that copy_file_range is used by a lot of things. Most programming languages "copy_file" functions use copy_file_range, everything from Rust to Emacs Lisp! The only language I can think of that doesn't use copy_file_range when copying files is Python.
On Gentoo, the portage package manager is written in Python but has some "native extensions", one of these extensions is copying files with copy_file_range, which is used when merging images to the root filesystem from my understanding.
Also GNU coreutils "cp" command uses it by default in recent releases, I'm not sure which release specifically introduced this change.
There are other things required to trigger the bug that are a lot less common though.
> It's worth noting that copy_file_range is used by a lot of things.
Yes, but the trigger feature, block cloning, only landed in the latest 2.2 release. If you immediately hopped on 2.2, and used a system with lots copy_file_range and FICLONE use, yes, you may have a problem (like, as you note, on Gentoo, where this problem surfaced).
Most people were just hopping on the bandwagon. My distro ships 2.1.5, so I have a 6 month wait until this feature lands, so I was just building copy_file_range support into my ZFS apps, right before news of this bug hit.[0]
> There are other things required to trigger the bug that are a lot less common though.
Exactly. My guess is the incidence of this will exceedingly rare for the common user/small NAS user/etc. I've run a corruption detector[1], and what I've found mostly indicates false positives. Fingers crossed, but, so far, no actual positive matches on a system with probably a little less than 1 million files.