Hacker News new | ask | show | jobs
by xk3 1242 days ago
This is very good advice. I did the same preparation, here is the distribution of files before the degraded state:

    Number of successful reads: 280
    Number of IO errors: 0
    Successful read files size: sum 82648303047 max 4884066696 average 295172511
then I unmounted the fs, deleted disk 2, echo 3 > /proc/sys/vm/drop_caches, and remounted the fs.

    sudo umount /mnt/loop
    echo 3 | sudo tee /proc/sys/vm/drop_caches
    echo 3 | sudo tee /proc/sys/vm/drop_caches
    echo 3 | sudo tee /proc/sys/vm/drop_caches

    dmesg --human --nopager --decode --level emerg,alert,crit,err,warn,notice,info
    kern  :info  : [Jan22 13:18] tee (215899): drop_caches: 3
    kern  :info  : [  +3.232287] tee (215931): drop_caches: 3
    kern  :info  : [  +0.775697] tee (215953): drop_caches: 3

    rm d2.img
    sudo mount "$ld1" /mnt/loop
I am surprised that mounting worked without error but I guess the device is still active via losetup. I'm assuming this would be similar to an actual disk failure though, if the device weren't there maybe btrfs will complain and ask to be mounted with the `-o degraded` flag.

There was nothing exciting in dmesg

    kern  :info  : [ +14.363762] BTRFS info (device loop0): using crc32c (crc32c-intel) checksum algorithm
    kern  :info  : [  +0.000004] BTRFS info (device loop0): using free space tree
Oohh weird...

    Number of successful reads: 280
    Number of IO errors: 0
    Successful read files size: sum 82648303047 max 4884066696 average 295172511

    sudo btrfs scrub status /mnt/loop/
    UUID:             a57027e5-feb8-4f58-9022-f5dc0a5c67ac
    Scrub started:    Sun Jan 22 13:33:49 2023
    Status:           finished
    Duration:         0:00:28
    Total to scrub:   77.25GiB
    Rate:             2.76GiB/s
    Error summary:    no errors found
Okay turns out the deleted file is still connected to the loopback device.

    sudo losetup -d $ld2
    sudo umount /mnt/loop
    echo 3 | sudo tee /proc/sys/vm/drop_caches
Now we get some interesting stuff in dmesg

    sudo mount -o degraded "$ld1" /mnt/loop
    mount: /mnt/loop: wrong fs type, bad option, bad superblock on /dev/loop0, missing codepage or helper program, or other error.
        dmesg(1) may have more information after failed mount system call.

    kern  :info  : [Jan22 13:37] tee (222135): drop_caches: 3
    kern  :info  : [ +16.362674] BTRFS info (device loop0): using crc32c (crc32c-intel) checksum algorithm
    kern  :info  : [  +0.000004] BTRFS info (device loop0): using free space tree
    kern  :err   : [  +0.000419] BTRFS error (device loop0): devid 2 uuid 1b352839-f719-499f-b9a7-25ed4d06e2be is missing
    kern  :err   : [  +0.000003] BTRFS error (device loop0): failed to read chunk tree: -2
    kern  :err   : [  +0.000183] BTRFS error (device loop0): open_ctree failed
    kern  :info  : [ +11.713125] BTRFS info (device loop0): using crc32c (crc32c-intel) checksum algorithm
    kern  :info  : [  +0.000004] BTRFS info (device loop0): allowing degraded mounts
    kern  :info  : [  +0.000001] BTRFS info (device loop0): using free space tree
    kern  :warn  : [  +0.000167] BTRFS warning (device loop0): devid 2 uuid 1b352839-f719-499f-b9a7-25ed4d06e2be is missing
    kern  :warn  : [  +0.007647] BTRFS warning (device loop0): chunk 2177892352 missing 1 devices, max tolerance is 0 for writable mount
    kern  :warn  : [  +0.000002] BTRFS warning (device loop0): writable mount is not allowed due to too many missing devices
    kern  :err   : [  +0.000155] BTRFS error (device loop0): open_ctree failed
But we can still mount it as read-only

    sudo mount -o ro,degraded "$ld1" /mnt/loop
And the results are

    Number of successful reads: 219
    Number of IO errors: 61
    Successful read files size: sum 21798190683 max 2122064756 average 99535117
    IO error files size:        sum 60850112364 max 4884066696 average 997542825
In this test about 26% of data is still fully readable (21798190683 / (21798190683+60850112364)).

I also tried another variant of the experiment where I did all of the above but ran this command before removing the disk:

    sudo rm /mnt/loop/file  # a 500 mb file that was included the above tests. I deleted this to give btrfs defrag some room to work
    sudo btrfs fi defrag -v -r -czstd /mnt/loop/
and the results are not much better... in fact they are worse 20% lol

    Number of successful reads: 199
    Number of IO errors: 80
    Successful read files size: sum 16695157031 max 2122064756 average 83895261
    IO error files size:        sum 65428858016 max 4884066696 average 817860725
1 comments

OK, so I have a few comments about your experiments:

> I am surprised that mounting worked without error but I guess the device is still active via losetup.

Exactly. `rm` doesn't actually delete the file contents while the file is still open, it just unlinks it from the filesystem tree. So your loopback-mounted disk is still there and all its contents are still available through /dev/loopX.

> I'm assuming this would be similar to an actual disk failure though, if the device weren't there maybe btrfs will complain and ask to be mounted with the `-o degraded` flag.

If the /dev/loopX device wasn't there then it would be similar to a complete disk failure, yes.

> In this test about 26% of data is still fully readable

It's true that only 26% of data is still fully readable if you account only for files that are fully intact. But also note that about 78% of files were still completely intact.

This is not clear from your comment, but I'm assuming that you are using 4 devices for the btrfs pool as well?

In this scenario, with such a disk configuration and subsequent disk failure you would expect to lose about 25% of files, while the remaining 75% would be intact (especially if the files are small enough)...

But actually, in reality things can be quite better or quite worse, depending on a few factors.

For example:

1. If the free space was fragmented. In such a case, a significant percentage of files might actually be allocated on more than one disk, so you'd lose more files than expected if a single disk fails. Although I can see that on your latter experiment, you've defragged the btrfs filesystem beforehand, so perhaps this is not the main issue.

2. Depending on how btrfs allocates data, if the files are not completely filling all of the disks then they can be heavily skewed towards a subset of the disks.

For example, imagine that each of your disks are 1 TB-sized and your files total less than 1 TB.

In this case, all of your files could be allocated on the first disk only, so losing this disk could lead to losing 100% of your data.

Or for example, if your files are less than 2 TB, they might all be allocated on the first 2 disks only, so losing one of these disks would lead to losing a lot more files than you'd expect if files were evenly distributed across all disks.

But on the other hand, if you'd lose one of the other disks, you might not lose any data whatsoever.

3. Depending on how large files are and how much free space there is on each disk, btrfs might be forced to (or might choose to) span a file across more than 1 disk even on the 'single' profile, even if free space was not fragmented.

4. But of course, more generally, how many files you would lose basically depends on how btrfs allocates disk space across the disks for each file.

These disk space allocation algorithms can be quite more complex than you'd expect from a naive allocator, mostly due to performance reasons.

Unfortunately, I know exactly nothing about how btrfs allocates data, so I can't give you more insight than this, sorry!

> 26% of data is still fully readable -- But also note that about 78% of files were still completely intact.

Do you mean partially intact? I did not count that data. Ah I think I understand what you're saying. 78% of files are fully readable (by number of files) but most of those are small files and those are stored within btrfs metadata ("inlined" extents)

I computed 26% using the quantity of data (ie. sum() of file size) which I feel is a more accurate representation of what is readable. Sure, btrfs `single` mode will have many partially readable files--and if that is an ideal failure state then I would recommend it.

I tried the same experiment with raid0 btrfs config and only the inlined extents were fully readable--less than 1MB recovered from 80GB of data.

> you would expect to lose about 25% of files, while the remaining 75% would be intact

That's what I expected from btrfs for that last 9 months and people online were saying that `single` mode is the same data guarantees as `raid0` mode--which is kind of true but also it is kind of not as we can see. It's true but not likely that in a highly fragmented filesystem the spread of data in `single` could be a similar shape to `raid0` and in that case you could only easily recover the same amount of data (almost none).

What happens in practice is that btrfs will allocate 1gb blocks one drive at a time but, in a multi-disk setup, it writes file extents to multiple disks at a time. So at the file level there are no guarantees about one file being on one disk. This is why I was only able to read 20~30% of data rather than the 75% you and I both naively expected from btrfs single mode. It's important to note that this 20~30% is not guaranteed--it depends how file extents are saved across multiple disks and that is probabilistic not deterministic.

> the remaining 75% would be intact (especially if the files are small enough)

If all the files are inlined extents (default limit is 2048 bytes per file), and you were using raid1c4 metadata profile, then theoretically you could have 100% intact even after losing 3 of 4 disks (regardless of what the data profile is set to since that would not be used to save the file data)--but you would be using 80GB of allocated as "metadata" space in btrfs. (I have not tested that scenario but I think it is likely to be true). So all of the file redundancy is provided by the raid1c3 metadata configuration which I used for <2kb files in my test but the larger files like the max() 2GB one were recovered due to the chance that the file extents were only saved on one or more of the other three disks.

> Although I can see that on your latter experiment, you've defragged the btrfs filesystem beforehand

Yes, I think btrfs defrag does not do much different from when it writes the files initially, but it is still a useful utility in situations where files were overwritten many times. As I understand it there are many reasons that btrfs will decide to write a file to multiple extents and there seems to be no option to have it write one file to one disk as much as possible

> all of your files could be allocated on the first disk only

maybe a good example would be if I had filled up a disk then added a new one. btrfs really tries to allocate data fairly but adding a new disk is a situation where it would definitely be skewed toward one disk. I was actually thinking of recreating my filesystem and just copying over data one disk at a time so that the file extents would be written more consolidated to each disk--but still there would be no mechanism to prevent cross-disk extent writing...

> But on the other hand, if you'd lose one of the other disks, you might not lose any data whatsoever

yep

> even if free space was not fragmented

That would be ideal but I think it is pretty unlikely in practice unless the 1gb blocks which btrfs allocates per disk are used immediately and no files are appended to or changed then there is lots of free space within each 1gb block for btrfs to find

> Unfortunately, ... I can't give you more insight than this

Your comments were helpful and interesting. Hopefully I could share some of my findings as well. I still like btrfs but it certainly acts like a mad chef who is trying to boil 6,827 pots of water to cook spaghetti in this situation.

multi-disk and `single` profile is a bit weird. I'm planning on switching my array to individual btrfs `single` profile disks with `dup` metadata. I will also try MergerFS to group them into one disk but if MergerFS feels sketchy I'll just interface directly with many disks and balance files between them manually