Hacker News new | ask | show | jobs
by londons_explore 1500 days ago
All these "clever" filesystems can never guarantee not to run out of space for their own metadata. That's because even to delete a file they might need more space in the journal, or to un-copy-on-write some metadata.

The mistake however is that even though it isn't practical to make theoretical guarantees that the filesystem won't end up full and broken, it is very possible to make such a thing only happen in exceeding unlikely cases. One runaway dd isn't that...

2 comments

>it is very possible to make such a thing only happen in exceeding unlikely cases. One runaway dd isn't that...

It's not dd, it's one process run by root who fills the filesystem with one big file. That's like the first thing i would test if it can destroy my filesystem.

It's really the filesystem responsibility, if it needs to reserve 30% so be it, if it need's more because i wrote billions of files so be it, (even if it says "sorry i told you i have 50GB free but because you wrote so many small files it's now just 45GB" after all they just can make a estimation) so be it. But it's the filesystem job to tell me how much ~free space that i have, and stop writing before it really/internally cant take anymore. And NOT to kill itself because i alloc 100% of it, there's is just no excuse. That's the filesystem's responsibility.

PS: The clever ZFS survives that "unlikely" test easily.

Why can't they? For example, Btrfs reserves some storage for it's internal use which should be more than enough to update the journal to fix a full filesystem.
Calculating exactly how much you need to reserve for the worst case is a near-impossible task.

For example, say you try to delete a file, which is part of one of multiple identical snapshots, so deleting the file doesn't free up any space, but does require extra metadata to be written (since a new directory entry will be needed that shows the file is deleted in this snapshot only).

The same operation could be done for millions of files, eating up all the reserved space. End result: full disk and unusable filesystem, even for deletes.

The alternative is not to allow file deletes to use reserved space. But now when you have a full disk, some things become 'undeletable', since the only way to free space is to delete all copies of the file, but it isn't permitted to delete any one copy of the file since the intermediate state would use more disk space.

What is supposed to happen is the metadata commit fails due to enospc before the super block update. Thus the current super points to the current value working tree roots, not the partial/failed tree roots.

Btrfs won't issue the writes for super block update until the device says the current metadata transaction is successfully on stable media.

It is possible the filesystem is completely consistent (can be mounted, btrfsck finds no error), and yet not bootable due to the interruption of updates. Software updates are one transaction in user space but not atomic unless expressly designed for it. From the fs point of view, a software update might be broken up into dozens of fs transactions.

It's also possible the device lies about writes being on stable media. If the fs writes some metadata, does flush/FUA, then super block write, and flash/FUA, the device should only write the super block after the prior write is on stable media. If it says the first flush succeeded but that write is still happening, and the super block write goes to stable media before all the metadata writes get to stable media and there's a crash or power failure, then you can in fact have a broken filesystem. The super points to tree roots that don't exist. This is definitely a device flaw not an fs flaw.

Btrfs super blocks contain 3 backup roots. So it's possible to revert to an older and hopefully correct metadata generation (seconds to a couple minutes ago). But this has limited recovery potential. It's also completely thwarted right now if you use any discard mount option on an SSD because discard will ask the device to garbage collect recently freed metadata blocks. So the backup root trees pointed to by the super may already be zeros when they're needed.

But any need for backup roots already means some kind of device (firmware) flaw.