Hacker News new | ask | show | jobs
by ajross 911 days ago
No, I get it. I'm saying that's a bad design. The data structure for a VM system is a big tree of discontiguous mappings, which matches the API used for accessing it. If you make a random access to memory at an arbitrary spot, you expect to get a VM trap. If you want to map memory, you're expected to know the layout and manage the "holes" yourself (or else to let the OS manage your memory space for you).

The data structure for a file is an ordered stream of bytes, which matches the API for accessing it. You can jump around by seeking, but there are no holes. Bytes start at 0 and go on from there. Want to seek() to an arbitrary value? Totally legal, presumptively valid.

Making the filesystem, implemented from first principles to handle the second style of interaction, actually be implemented in terms of the first under the hood, is a source of needless complexity and bugs. And it was here, too.

2 comments

> Making the filesystem, implemented from first principles to handle the second style of interaction, actually be implemented in terms of the first under the hood, is a source of needless complexity and bugs. And it was here, too.

Aren’t all modern file systems implemented as a tree of discontinuous regions? That’s the whole reason block allocators exist, why file fragmentation is a thing (and defragmentation processes).

How could you reasonably expect to implement a filesystem that under hood only operates with continuous blocks disk space? It would require the filesystem to have prior knowledge of the size of all the files that going to be written, so it can pre-allocate the continuous sections. Or the second writing a file resulted in that file exceeding the length of the continuous empty section of disk, future writes would have to pause until the filesystem had finished copying the entire file to a new region with more space.

With ZFS its heavy dependence on tree structures of discontinuous address regions is what enables all of its desirable feature. To say the complexity is needless is to implicitly say ZFS itself is pointless.

The issue is that pretty much all other filesystems at least on Linux, are effectively implemented as swap filesystem drivers with some hierarchical structure on top, because that's the interface pushed by Linux at kernel level.

In userland, we tend to think of streams of bytes, as provided by original Unix and as all the docs teach us to treat them - that read(), write() are the primitives and they do byte-aligned reads and writes.

Except the actual Linux VFS has, as its core primitive, mmap() + pagein/pageout mechanism, with read() and write() being simulated over the pagecache which treats the files as mmap()ed memory regions. It's how IO caching is done on Linux, and it's source of various issues for ZFS and people using different architectures because for a long time (changed quite recently, afaik) Linux VFS only supported page-sized or smaller filesystem blocks. Which is a bit of a problem if you're a filesystem like ZFS where the file block can go from 512b to 4MB (or more) in the same dataset, or VMFS which uses 1MB blocks.

What any of that got to do with the bug described in the article? Presumably every filesystem is responsible for tracking the content of sparse files, and where holes are. That's not something the Linux kernel is going to give you for free, the FS needs tell the kernel which pages should be mapped to block address on disk and which pages should be simulated as continuous blocks of zeros with no on-disk representation.
It's related to the talk about filesystem interface metaphors in this specific subthread :)
That's true of a storage backend, but not the metaphor presented. Again, the analogy would be a heap: heaps are discontiguous internally too, but you don't demand that users of malloc() understand that there can be a hole in the middle of their memory! Again, the bug here was (seems to have been, it's subtle) a glitch in the tracking of holes in files that didn't ever need to have been there in the first place.
But ZFS doesn't demand that users be aware of holes in files. You can just call `seek()` and `read()` to anywhere, and ZFS will transparently provide zeros to fill the holes. Linux also allows software to become "hole-aware" using `lseek()`, but that's an optimisation that software can opt into, but can equally just ignore.

The glitch in this case was a failure to correctly track dirty pages that have yet to be written to disk, and thus reading the on-disk data, rather than the data in-memory data within the dirty page. I just so happens this issue only appears in the code that's responsible for responding to queries about holes from software that's explicitly asking to know about the holes. ZFS itself never had any issues keeping track of the holes, the bookkeeping always converged on the correct state, it's just that during that convergence it was momentarily possible to be given old metadata about holes (i.e. what's currently on disk), rather than the current metadata about holes (i.e. what's currently only in-memory, and about to be written to disk).

There are pretty good reasons for treating files as sparse; virtualization and deduplication. Virtualization of storage devices without sparse files would be slowed tremendously by the need to allocate and zero large regions before use, essentially double-writing during the installation and initial provisioning stage. You can force the virtualization layer to implement sparse storage but then you get a host of incompatible disk image formats (vmdk, qcow2, etc.) and N times as many opportunities for bugs like the article describes to be introduced.

Deduplication is basically a superset of sparse files where the zero block is a single instance of duplication. Deduplication isn't right for every task but for basically any public shared storage some form of deduplication is vital to avoid wasting duplicate copies.

Sparse/deduplicated files still maintain the read/write semantics of files as streams of bytes; they allow additional operations not part of the original Unix model. Exposing them to userspace probably isn't a mistake per se because it is essentially no different than ioctls or socket-related functions that are a vital part of Unix at this point.

Those aren't general principles. They're just tricks. Some software uses them. No significant software paradigms are critically dependent on sparse files. Quite frankly almost no significant market-driving software uses them at all. Not sure what you have in mind, but a few examples might be helpful?

All of them have a straightforward expression using contiguous storage. At best, sparse files allow you to reduce application-layer complexity. But as I'm pointing out, that comes at the cost of filesystem-layer complexity up and down the stack and throughout the kernel, and that's a bad trade.