In about 4 years of running it on a couple of servers and countless virtuals/desktops, I've never had a reliability issue that was directly related to btrfs. I do not have my servers plugged in to UPSes, I have the occasional "shutdown due to power loss". The only time I've lost data has been due to cable disconnection in my hardware RAID array, and even then I was able to recover a substantial amount of its `btrfs` stored files.
[0] Well, not filesystem-provided RAID; I have LSI controllers that provide the array to the OS as a single disk.
As best I can tell, reports of data loss on btrfs are all from the early 20-teens; after about 2014 or so I can't find anyone who claims to have lost data due to a btrfs bug on an up-to-date system.
On Btrfs, in case of bad parity being used to reconstruct a stripe, the resulting bad reconstruction is still subject to data checksumming, and will EIO. Corrupt data won't be sent to user space.
I think in Linux, if you're using mdadm there is the ability to specify a write journal; all data (i.e. blocks+parity) gets written to the journal first, and then gets cleaned up after everything gets completed successfully, and the journal is replayed after a power failure.
Mind you, for that to work well you'd want a victim SSD with a write speed at least that of the array...
Hardware RAID can also suffer from this indeed but does ZFS suffer from it as well? With exactly the same impact? AFAIK the filesystem stays consistent on ZFS.
> ather than the stripe width be statically set at creation, the stripe width is dynamic. Every block transactionally flushed to disk is its own stripe width. Every RAIDZ write is a full stripe write. Further, the parity bit is flushed with the stripe simultaneously, completely eliminating the RAID-5 write hole. So, in the event of a power failure, you either have the latest flush of data, or you don't. But, your disks will not be inconsistent.
> There's a catch however. With standardized parity-based RAID, the logic is as simple as "every disk XORs to zero". With dynamic variable stripe width, such as RAIDZ, this doesn't work. Instead, we must pull up the ZFS metadata to determine RAIDZ geometry on every read. If you're paying attention, you'll notice the impossibility of such if the filesystem and the RAID are separate products; your RAID card knows nothing of your filesystem, and vice-versa. This is what makes ZFS win.
It's the default for new Synology devices, and has been for a while. I suspect others are using it in a similar situation for home-grade NAS and up into the prosumer end of the market.
I feel like Btrfs is probably going to be well tested here, but I wonder how many of these users are diagnosing Btrfs problems when they occur? It's going to be more evident to some people, and you have to assume that some of the vendors are competent, but this is against a backdrop of people throwing this kit away or starting from scratch versus performing a root cause analysis.
I've personally been running this since it was stable on my DS1515+. I haven't had filesystem issues yet, but I make sure my important stuff is backed up elsewhere. A local backup like this is convenient for faster recovery in a lot of situations though which is why I keep it. I've SSH'd to the device and played around a little, but I fear I'd hit something proprietary, if the worst recovery situation occurred and I had to get everything from the DS1515+. If it was just an Ubuntu box I wouldn't have those fears, but the Syno NAS package is compelling.
My understanding is most bugs are ironed out of btrfs itself, but tooling is still weak. For example, if you have a disk drive go bad on you and you manage to recover ~ half of the sectors with a disk imaging tool, you won't be able to extract files from the image without extreme effort.
Why hasn't this caught up? Is it the case that data recovery companies are hoarding this after investing in their own tools, or something fundamental to the community?
Reliability does not only mean data loss. It may not be losing data but crashing every few hours, or locking up the system, or requiring constant monitoring and maintenance etc.