| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by st_goliath 178 days ago

> ... a partial download would be totally useless ...

no, not totally. The directory at the end of the archive points backwards to local headers, which in turn include all the necessary information, e.g. the compressed size inside the archive, compression method, the filename and even a checksum.

If the archive isn't some recursive/polyglot nonsense as in the article, it's essentially just a tightly packed list of compressed blobs, each with a neat, local header in front (that even includes a magic number!), the directory at the end is really just for quick access.

If your extraction program supports it (or you are sufficiently motivated to cobble together a small C program with zlib....), you can salvage what you have by linearly scanning and extracting the archive, somewhat like a fancy tarball.

3 comments

nwallin 178 days ago

At work, our daily build (actually 4x per day) is a handful of zip files totaling some 7GB. The script to get the build would copy the archives over the network, then decompress then into your install directory.

This works great on campus, but when everyone went remote during COVID it wasn't anymore. It went from three minutes to like twenty minutes.

However. Most files change only rarely. I don't need all the files, just the ones which are different. So I wrote a scanner thing which compares the zip file's filesize and checksum to the checksum of the local file. If they're the same, we skip it, otherwise, we decompress out of the zip file. This cut the time to get the daily build from 20 minutes to 4 minutes.

Obviously this isn't resilient to an attacker, crc32 is not secure, but as an internal tool it's awesome.

link

btilly 178 days ago

How would this have compared to using rsync?

link

necovek 178 days ago

Not as much geek cred for using an off the shelf solution? ;)

link

tonyedgecombe 178 days ago

XPS (Microsoft's alternative to PDF) supported this. XPS files were ZIP files under the hood and were handled directly by some printers. The problem was the printer never had enough memory to hold a large file so you had to structure the document in a way it could be read a page at a time from the start.

link

brabel 178 days ago

> the directory at the end is really just for quick access.

No, its purpose was to allow multi floppy disks archives. You would insert the last disk, then the other ones, one by one…

link

st_goliath 178 days ago

That literally is quick access, it does the same thing in both cases, trying to get rid of the linear scan and having to plow through data unnecessarily.

If the archive is on a hard disk, the program reads the directory at the end and then seeks to the local header, rather than doing a linear scan. Or the floppy motor, if it is a small archive on a single floppy.

If you have multiple floppies, you insert the last one, the program reads the header and then tells you what floppy to insert, rather than having to go through them one by one, which you know, would be slower.

In one case, a hard disk arm, or the floppy motor, does the seeking, in the other case, your hands do the seeking. But it's still the same algorithm, doing the same thing, for the same reason.

link