| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by duskwuff 252 days ago

ZIP isn't a terrible format, but it has a couple of flaws and limitations which make it a less than ideal format for long-term archiving. The biggest ones I'd call out are:

1) The format has limited and archaic support for file metadata - e.g. file modification times are stored as a MS-DOS timestamp with a 2-second (!) resolution, and there's no standard system for representing other metadata.

2) The single-level central directory can be awkward to work with for archives containing a very large number of members.

3) Support for 64-bit file sizes exists but is a messy hack.

4) Compression operates on each file as a separate stream, reducing its effectiveness for archives containing many small files. The format does support pluggable compression methods, but there's no straightforward way to support "solid" compression.

5) There is technically no way to reliably identify a ZIP file, as the end of central directory record can appear at any location near the end of the file, and the file can contain arbitrary data at its start. Most tools recognize ZIP files by the presence of a local file header at the start ("PK\x01\x02"), but that's not reliable.

2 comments

Lammy 252 days ago

> there's no straightforward way to support "solid" compression.

I do it by ignoring ZIP's native compression entirely, using store-only ZIP files and then compressing the whole thing at the filesystem level instead.

Here's an example comparison of the same WWW site rip in a DEFLATE ZIP, in a store-only ZIP with zstd filesystem compression, in a tar with same zstd filesystem compression (identical size but less useful for seeking due to lack of trailing directory versus ZIP), and finally the raw size pre-zipping:

  982M preserve.mactech.com.deflate.zip
  408M preserve.mactech.com.store.zip
  410M preserve.mactech.com.tar
  3.8G preserve.mactech.com


  [Lammy@popola] zfs get compression spinthedisc/Backups/WWW
  NAME                     PROPERTY     VALUE           SOURCE
  spinthedisc/Backups/WWW  compression  zstd            local

This probably wouldn't help GP with their need for HTTP seeking since their HTTP server would incur a decompress+recompress at the filesystem boundary.

link

nicman23 252 days ago

lool why use zip then anyways? put them in a folder

link

Lammy 251 days ago

It's for when you have a very large number of mostly-identical files, like web pages with consistent header and footer. If 408MiB versus 3.8GiB is a meaningless difference to you then sure don't bother with compression, but why I want it should be very obvious to most people here.

link

nicman23 250 days ago

you completely missed what i asked you but ok

link

Lammy 243 days ago

I don't think I did, but please explain :)

The last example in my list of four file sizes is them in a folder. Filesystem compression works at the file level, so you have to turn many-almost-identical-files into one file in order to benefit from it. ZFS does have block-level deduplication, but that's it's own can of worms that shouldn't be turned on flippantly due to resource requirements and `recordsize` tuning needed to really benefit from it.

link

nicman23 241 days ago

you do not need dedup just use reflinks for everything. if that workflow does not work then eh i understand why you would use zips

although zfs dedup is probably better in 2025

link

gildas 251 days ago

FYI, zip.js has no issues with 1 (it can be fixed with standard extra fields), 3 (zip64 support), and 5 (you cannot have more than 64K of comment data at the end of the file).

link

duskwuff 251 days ago

With regard for the first two - that's good for zip.js, but the problem is that support for those features isn't universal. There's been a lot of fragmentation over the last 36 years (!).

As far as the last (file type detection) goes, the generally agreed upon standard is that file formats should be "sniffable" by looking for a signature in the file's header - ideally within the first few bytes of the file. Having to search through 64 KB of the file's end for a signature is a major departure from that pattern.

link