Hacker News new | ask | show | jobs
by benlivengood 911 days ago
There are pretty good reasons for treating files as sparse; virtualization and deduplication. Virtualization of storage devices without sparse files would be slowed tremendously by the need to allocate and zero large regions before use, essentially double-writing during the installation and initial provisioning stage. You can force the virtualization layer to implement sparse storage but then you get a host of incompatible disk image formats (vmdk, qcow2, etc.) and N times as many opportunities for bugs like the article describes to be introduced.

Deduplication is basically a superset of sparse files where the zero block is a single instance of duplication. Deduplication isn't right for every task but for basically any public shared storage some form of deduplication is vital to avoid wasting duplicate copies.

Sparse/deduplicated files still maintain the read/write semantics of files as streams of bytes; they allow additional operations not part of the original Unix model. Exposing them to userspace probably isn't a mistake per se because it is essentially no different than ioctls or socket-related functions that are a vital part of Unix at this point.

1 comments

Those aren't general principles. They're just tricks. Some software uses them. No significant software paradigms are critically dependent on sparse files. Quite frankly almost no significant market-driving software uses them at all. Not sure what you have in mind, but a few examples might be helpful?

All of them have a straightforward expression using contiguous storage. At best, sparse files allow you to reduce application-layer complexity. But as I'm pointing out, that comes at the cost of filesystem-layer complexity up and down the stack and throughout the kernel, and that's a bad trade.