Hacker News new | ask | show | jobs
by adrianmonk 2286 days ago
It really depends on the type of file. ("Other large blob types" is a rather broad category.)

One obvious example where you could have a lot of common blocks (even following the offset where a change was made) is zip files. The zip format basically compresses each file individually and then concatenates all that together.

Let's say you have a build and it packages the results up as a big zip file. (Java builds often do this. A jar is a special type of zip file.) If you change a few source files and rebuild, and if your build is deterministic (and/or incremental), then the new zip file will contain a lot of the same stuff as the previous version. And if your zip archiver is deterministic (pretty safe assumption), it should produce a zip file that is mostly the same sequences of bytes as the previous zip file, even if there are changed files in the middle.

If you write a .tar.gz archive, then one change in the middle will throw everything off from that point on because it compresses the whole archive instead of individual files. In theory a binary diff can work around this by first undoing the gzip that was done to create each large blobs, then doing a binary diff on that, and then arranging to be able to recreate what gzip did. Obviously that's messy.

Of course, not every file is an archive. Some are filesystems. But any writable filesystem (notably not including ISOs) that is capable of being used on a hard disk will of necessity not rewrite everything. If it did, changing on one file on a filesystem would take hours because the rest of the partition would have to be rewritten.

Another obvious type of big blob is multimedia. I don't know a lot of specifics, but I would think file formats meant for editors would keep changes localized for reducing IO (for example, so that changes in a non-linear video editor don't need to write a giant file), but formats meant for export and delivery might change the whole file since they're aiming for small size.

1 comments

So ZIPs don't have any "global" directory thing? :o
They don't have a global compression dictionary thing.

Similar effect can be achieved with gzip --rsyncable, which IIRC resets the dictionary based on a rolling sum.

They have a non essential copy of the directory at the end for spoed; tools exist to rebuild it if it is corrupted from the entries inside the file. But it is usually very small (the only real life exception I met is the hvsc archive where the directory size is very significant - so they zip it again)