Hacker News new | ask | show | jobs
by e145bc455f1 824 days ago
Just last week my btrfs filesystem got irrecoverably corrupted. This is like the fourth time it has happened to me in the last 10 years. Do not use it in consumer grade hardware. Compared to this, ext4 is rock solid. It was even able to survive me accidentally passing the currently running host's hard disk to a VM guest, which booted from it.
8 comments

> It was even able to survive me accidentally passing the currently running host's hard disk to a VM guest, which booted from it.

I have also done this, and was also happy that the only corruption was to a handful of unimportant log files. Part of a robust filesystem is that when the user does something stupid, the blast radius is small.

Other less-smart filesystems could easily have said "root of btree version mismatch, deleting bad btree node, deleting a bunch of now unused btree nodes, your filesystem is now empty, have a nice day".

I have had same btrfs filesystem in use for 15+ years, with 6 disks of various sizes. And all hardware components changed at least once during the fileystsen lifetime.

Worst corruption was when one DIMM started corrupting data. As a result computer kept crashing and eventually refused to mount because of btrfs checksum mismatches.

Fix was to buy new HW. Then run btrfs filesystem repairs, which failed at some point but at least got the filesystem running as long as I did not touch the most corrupted locations, luckily it was RAID1 so most checksums had a correct value on another disk. Unfortunately the checksum tree had on two locations corruption on both copies. I had to open the raw disks with hex editor and change the offending byte to correct value, after which the filesystem has been running again smoothly for 5 years.

And to find the location to modify on the disks I built a custom kernel that printed the expected value and absolute disk position when it detected the specific corruption. Plus had to ask a friend to double check my changes since I did not have any backups.

> running again smoothly for 5 years

So did you bite the bullet and get ECC, or are you just waiting for the next corruption caused by memory errors? :)

> last week my btrfs filesystem got irrecoverably corrupted.

This is 2 bugs really. 1, the file system got corrupted. 2, tooling didn't exist to automatically scan through the disk data structures and recover as much of your drive as possible from whatever fragments of metadata and data were left.

For 2, it should happen by default. Most users don't want a 'disk is corrupt, refusing to mount' error. Most users want any errors to auto-correct if possible and get on with their day. Keep a recovery logfile with all the info needed to reverse any repairs for that small percentage of users who want to use a hex editor to dive into data corruption by hand.

Yeah the last time I had a btrfs volume die, there were a few troubleshooting/recovery steps on the wiki which I dutifully followed. Complete failure, no data recoverable. The last step was "I dunno, go ask someone on IRC." Great.

It's understandable that corruption can happen due to bugs or hardware failure or user insanity, but my experience was that the recovery tools are useless, and that's a big problem.

Writing to a corrupted filesystem by default is bad design. The corruption could be caused by a hardware problem that is exacerbated by further writes, leading to additional data loss.
Where is that log file supposed to be stored? It can't be on the same filesystem it was created for or it negates the purpose of its creation.
If I were designing it, the recovery process would:

* scan through the whole disk and, for every sector, decide if it is "definitely free space (part of the free space table, not referenced by any metadata)", "definitely metadata/file data", "unknown/unsure (ie. perhaps referenced by some dangling metadata/an old version of some tree nodes)".

* I would then make a new file containing a complete image of the whole filesystem pre-repair, but leaving out the 'definitely free space' parts.

* such a file takes nearly zero space, considering btrfs's copy-on-write and sparse-file abilities.

* I would then repair the filesystem to make everything consistent. The pre-repair file would still be available for any tooling wanting to see what the filesystem looked like before it was repaired. You could even loopmount it or try other repair options on it.

* I would probably encourage distros to auto-delete this recovery file if disk space is low/after some time, since otherwise the recovery image will end up pinning user data to using up disk space for years and users will be unhappy.

The above fails in only one case: Free space on the drive is very low. In that case, I would probably just do the repairs in-RAM and mount the filesystem readonly, and have a link to a wiki page on possible manual repair routes.

>The above fails in only one case: Free space on the drive is very low.

No. Most of the block will be marked as unsure in first step -- because most of them had been used before thanks to CoW

A heuristic could be written like 'protect the latest version of each node, plus 2 prior versions, but anything older you find, treat it as free apace'.
Best send a bugreport to the btrfs mailing list at linux-btrfs@vger.kernel.org.

If possible include the last kernel log entries before it corrupted. Include kernel version, drive model and drive firmware version.

Huh. I've been running btrfs on a number of systems for probably 12 years at this point. One array in particular was 12TiB of raw storage used for storing VM images in heavy use. Each disk had ~9 years of spindle-on time before I happened to look closely at the SMART output and realized that they were all ST3000DM001's and promptly swapped them all out. The only issue I've ever run into is running out of metadata chunks and needing to rebalance, and that was just once.
> Compared to this, ext4 is rock solid.

Ext4 is the most reliable file system I have ever used. Just works and has never failed on me, not even once. No idea why btrfs can't match its quality despite over a decade of development.

how do you know it was issue with FS and not actual hardware/disk?..
Yeah, that's the fun part of the ext/btrfs corruption posts. If you got repeating corruption on btrfs on the same drive but not on ext, how do you know it's not just a drive failure that ext is not able to notice? What would happen if you tried ext with dm-integrity?
Bad DIMM is a thing, even more so on consumer HW that lack ECC. I recommend you run memtest