Hacker News new | ask | show | jobs
by electroly 918 days ago
Isn't this a benefit you'd trivially get just by using .zip? I pull individual files out of large .zip archives in S3 using HTTP range requests; works exactly as you'd expect. You know the zip header is at the end of the file, and the header tells you the offset and length of the compressed entry data so you can request the range. Two requests if you've never seen the .zip before, one if you've got the zip header cached.
2 comments

As mentioned it's trivial across the spread of compression algorithms supporting this type of behaviour (`gzip`, `zstandard`, `zip`, ...), the header in `zip` making it even more convenient as you note!

WARC as a format essentially states that unless you have good reason "record at a time" compression is the preferred[1]. The mixture of "technically possible" and "part of spec" is what makes it so useful - any generic WARC tool can support random access, there are explicit fields to index over (URL), and even non-conforming WARC files can be easily rewritten to add such a capability.

[1]: https://iipc.github.io/warc-specifications/specifications/wa...

It occurs to me that you could stick a few bytes of header in the beginning of the ZIP file, to tell you the exact location of the header at the end of it, thus avoiding multiple lookups. It would even still be ZIP-compatible.
Definitely. I take an alternative but similar approach: since I control the zip files, I can guarantee that the header is always within the last N kilobytes of the zip file (configurable value of N). I spend a HEAD request to get the length of the zip file and then walk backwards by N kilobytes. You would request the few bytes at the beginning instead of using that request to get the file length.
How can you guarantee that it's always within N kilobytes, when N depends on the number of files in the zip?
If you're creating the zips in the first place, you can just check and see how big the headers are when you create them. If you happen to get N wrong, you can request another chunk, but obviously it's nice to avoid multiple requests to get the header. For my use case, the number of files is small and relatively consistent between zips so a generous value of 64KB ended up working great.