Hacker News new | ask | show | jobs
by avianlyric 911 days ago
> Making the filesystem, implemented from first principles to handle the second style of interaction, actually be implemented in terms of the first under the hood, is a source of needless complexity and bugs. And it was here, too.

Aren’t all modern file systems implemented as a tree of discontinuous regions? That’s the whole reason block allocators exist, why file fragmentation is a thing (and defragmentation processes).

How could you reasonably expect to implement a filesystem that under hood only operates with continuous blocks disk space? It would require the filesystem to have prior knowledge of the size of all the files that going to be written, so it can pre-allocate the continuous sections. Or the second writing a file resulted in that file exceeding the length of the continuous empty section of disk, future writes would have to pause until the filesystem had finished copying the entire file to a new region with more space.

With ZFS its heavy dependence on tree structures of discontinuous address regions is what enables all of its desirable feature. To say the complexity is needless is to implicitly say ZFS itself is pointless.

2 comments

The issue is that pretty much all other filesystems at least on Linux, are effectively implemented as swap filesystem drivers with some hierarchical structure on top, because that's the interface pushed by Linux at kernel level.

In userland, we tend to think of streams of bytes, as provided by original Unix and as all the docs teach us to treat them - that read(), write() are the primitives and they do byte-aligned reads and writes.

Except the actual Linux VFS has, as its core primitive, mmap() + pagein/pageout mechanism, with read() and write() being simulated over the pagecache which treats the files as mmap()ed memory regions. It's how IO caching is done on Linux, and it's source of various issues for ZFS and people using different architectures because for a long time (changed quite recently, afaik) Linux VFS only supported page-sized or smaller filesystem blocks. Which is a bit of a problem if you're a filesystem like ZFS where the file block can go from 512b to 4MB (or more) in the same dataset, or VMFS which uses 1MB blocks.

What any of that got to do with the bug described in the article? Presumably every filesystem is responsible for tracking the content of sparse files, and where holes are. That's not something the Linux kernel is going to give you for free, the FS needs tell the kernel which pages should be mapped to block address on disk and which pages should be simulated as continuous blocks of zeros with no on-disk representation.
It's related to the talk about filesystem interface metaphors in this specific subthread :)
That's true of a storage backend, but not the metaphor presented. Again, the analogy would be a heap: heaps are discontiguous internally too, but you don't demand that users of malloc() understand that there can be a hole in the middle of their memory! Again, the bug here was (seems to have been, it's subtle) a glitch in the tracking of holes in files that didn't ever need to have been there in the first place.
But ZFS doesn't demand that users be aware of holes in files. You can just call `seek()` and `read()` to anywhere, and ZFS will transparently provide zeros to fill the holes. Linux also allows software to become "hole-aware" using `lseek()`, but that's an optimisation that software can opt into, but can equally just ignore.

The glitch in this case was a failure to correctly track dirty pages that have yet to be written to disk, and thus reading the on-disk data, rather than the data in-memory data within the dirty page. I just so happens this issue only appears in the code that's responsible for responding to queries about holes from software that's explicitly asking to know about the holes. ZFS itself never had any issues keeping track of the holes, the bookkeeping always converged on the correct state, it's just that during that convergence it was momentarily possible to be given old metadata about holes (i.e. what's currently on disk), rather than the current metadata about holes (i.e. what's currently only in-memory, and about to be written to disk).