Hacker News new | ask | show | jobs
by gildas 241 days ago
For implementation in a library, you can use HttpRangeReader [1][2] in zip.js [3] (disclaimer: I am the author). It's a solid feature that has been in the library for about 10 years.

[1] https://gildas-lormeau.github.io/zip.js/api/classes/HttpRang...

[2] https://github.com/gildas-lormeau/zip.js/blob/master/tests/a...

[3] https://github.com/gildas-lormeau/zip.js

1 comments

Based on your experience, is zip the optimal archive format for long term digital archival in object storage if the use case calls for reading archives via http for scanning and cherry picking? Or is there a more optimal archive format?
Unfortunately, I will have difficulty answering your question because my knowledge is limited to the zip format. In the use case presented in the article, I find that the zip format meets the need well. Generally speaking, in the context of long-term archiving, its big advantage is also that there are thousands of implementations for reading/writing zip files.
ZIP isn't a terrible format, but it has a couple of flaws and limitations which make it a less than ideal format for long-term archiving. The biggest ones I'd call out are:

1) The format has limited and archaic support for file metadata - e.g. file modification times are stored as a MS-DOS timestamp with a 2-second (!) resolution, and there's no standard system for representing other metadata.

2) The single-level central directory can be awkward to work with for archives containing a very large number of members.

3) Support for 64-bit file sizes exists but is a messy hack.

4) Compression operates on each file as a separate stream, reducing its effectiveness for archives containing many small files. The format does support pluggable compression methods, but there's no straightforward way to support "solid" compression.

5) There is technically no way to reliably identify a ZIP file, as the end of central directory record can appear at any location near the end of the file, and the file can contain arbitrary data at its start. Most tools recognize ZIP files by the presence of a local file header at the start ("PK\x01\x02"), but that's not reliable.

> there's no straightforward way to support "solid" compression.

I do it by ignoring ZIP's native compression entirely, using store-only ZIP files and then compressing the whole thing at the filesystem level instead.

Here's an example comparison of the same WWW site rip in a DEFLATE ZIP, in a store-only ZIP with zstd filesystem compression, in a tar with same zstd filesystem compression (identical size but less useful for seeking due to lack of trailing directory versus ZIP), and finally the raw size pre-zipping:

  982M preserve.mactech.com.deflate.zip
  408M preserve.mactech.com.store.zip
  410M preserve.mactech.com.tar
  3.8G preserve.mactech.com


  [Lammy@popola] zfs get compression spinthedisc/Backups/WWW
  NAME                     PROPERTY     VALUE           SOURCE
  spinthedisc/Backups/WWW  compression  zstd            local

This probably wouldn't help GP with their need for HTTP seeking since their HTTP server would incur a decompress+recompress at the filesystem boundary.
lool why use zip then anyways? put them in a folder
It's for when you have a very large number of mostly-identical files, like web pages with consistent header and footer. If 408MiB versus 3.8GiB is a meaningless difference to you then sure don't bother with compression, but why I want it should be very obvious to most people here.
you completely missed what i asked you but ok
FYI, zip.js has no issues with 1 (it can be fixed with standard extra fields), 3 (zip64 support), and 5 (you cannot have more than 64K of comment data at the end of the file).
With regard for the first two - that's good for zip.js, but the problem is that support for those features isn't universal. There's been a lot of fragmentation over the last 36 years (!).

As far as the last (file type detection) goes, the generally agreed upon standard is that file formats should be "sniffable" by looking for a signature in the file's header - ideally within the first few bytes of the file. Having to search through 64 KB of the file's end for a signature is a major departure from that pattern.