| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by benlivengood 911 days ago

There are pretty good reasons for treating files as sparse; virtualization and deduplication. Virtualization of storage devices without sparse files would be slowed tremendously by the need to allocate and zero large regions before use, essentially double-writing during the installation and initial provisioning stage. You can force the virtualization layer to implement sparse storage but then you get a host of incompatible disk image formats (vmdk, qcow2, etc.) and N times as many opportunities for bugs like the article describes to be introduced.

Deduplication is basically a superset of sparse files where the zero block is a single instance of duplication. Deduplication isn't right for every task but for basically any public shared storage some form of deduplication is vital to avoid wasting duplicate copies.

Sparse/deduplicated files still maintain the read/write semantics of files as streams of bytes; they allow additional operations not part of the original Unix model. Exposing them to userspace probably isn't a mistake per se because it is essentially no different than ioctls or socket-related functions that are a vital part of Unix at this point.

1 comments

ajross 911 days ago

Those aren't general principles. They're just tricks. Some software uses them. No significant software paradigms are critically dependent on sparse files. Quite frankly almost no significant market-driving software uses them at all. Not sure what you have in mind, but a few examples might be helpful?

All of them have a straightforward expression using contiguous storage. At best, sparse files allow you to reduce application-layer complexity. But as I'm pointing out, that comes at the cost of filesystem-layer complexity up and down the stack and throughout the kernel, and that's a bad trade.

link