|
|
|
|
|
by avianlyric
911 days ago
|
|
> Making the filesystem, implemented from first principles to handle the second style of interaction, actually be implemented in terms of the first under the hood, is a source of needless complexity and bugs. And it was here, too. Aren’t all modern file systems implemented as a tree of discontinuous regions? That’s the whole reason block allocators exist, why file fragmentation is a thing (and defragmentation processes). How could you reasonably expect to implement a filesystem that under hood only operates with continuous blocks disk space? It would require the filesystem to have prior knowledge of the size of all the files that going to be written, so it can pre-allocate the continuous sections. Or the second writing a file resulted in that file exceeding the length of the continuous empty section of disk, future writes would have to pause until the filesystem had finished copying the entire file to a new region with more space. With ZFS its heavy dependence on tree structures of discontinuous address regions is what enables all of its desirable feature. To say the complexity is needless is to implicitly say ZFS itself is pointless. |
|
In userland, we tend to think of streams of bytes, as provided by original Unix and as all the docs teach us to treat them - that read(), write() are the primitives and they do byte-aligned reads and writes.
Except the actual Linux VFS has, as its core primitive, mmap() + pagein/pageout mechanism, with read() and write() being simulated over the pagecache which treats the files as mmap()ed memory regions. It's how IO caching is done on Linux, and it's source of various issues for ZFS and people using different architectures because for a long time (changed quite recently, afaik) Linux VFS only supported page-sized or smaller filesystem blocks. Which is a bit of a problem if you're a filesystem like ZFS where the file block can go from 512b to 4MB (or more) in the same dataset, or VMFS which uses 1MB blocks.