Hacker News new | ask | show | jobs
by Smerity 918 days ago
In the distant past I was the lone engineer of Common Crawl almost a decade ago. Common Crawl heavily leverages the WARC format.

My favorite capability of the WARC format borrows from the fact that most compression formats can be written to allow random access. Compression formats such as `gzip` and `zstandard` allow multiple compressed streams to be stuck together and act during decompression as if it's one contiguous file.

Hence you can create multiple compressions and literally stick them together:

  $ echo cat > cat.txt
  $ echo dog > dog.txt
  $ zstd cat.txt dog.txt
  $ cat cat.txt.zst dog.txt.zst > catdog.zst
  $ zstdcat catdog.zst
  cat
  dog
For files composed of only a textual / clearly delimited format that means you can fairly trivially leap to a different offset assuming each of the inputs is compressed individually. You lose out on some amount of compression but random lookup seems a fairly reasonable tradeoff. Common Crawl was able to use this to allow entirely random lookups into web crawl datasets dozens / hundreds of terabytes in size without any change in file format for example and utilizing Amazon S3's support for HTTP Range requests[1].

Trading compression for random lookup is even more forgiving if you create a separate compression dictionary tailored toward your dataset. For web crawling you'd likely get you the majority of the compression gains back unless pages from the same website are sequentially written which is unlikely in most situations. The website's shared template/s would result in very high compression gains across files which you'd lose by allowing random lookup but most crawlers don't don't operate sequentially so local compression gains are likely smaller than larger.

[1]: https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requ...

1 comments

Isn't this a benefit you'd trivially get just by using .zip? I pull individual files out of large .zip archives in S3 using HTTP range requests; works exactly as you'd expect. You know the zip header is at the end of the file, and the header tells you the offset and length of the compressed entry data so you can request the range. Two requests if you've never seen the .zip before, one if you've got the zip header cached.
As mentioned it's trivial across the spread of compression algorithms supporting this type of behaviour (`gzip`, `zstandard`, `zip`, ...), the header in `zip` making it even more convenient as you note!

WARC as a format essentially states that unless you have good reason "record at a time" compression is the preferred[1]. The mixture of "technically possible" and "part of spec" is what makes it so useful - any generic WARC tool can support random access, there are explicit fields to index over (URL), and even non-conforming WARC files can be easily rewritten to add such a capability.

[1]: https://iipc.github.io/warc-specifications/specifications/wa...

It occurs to me that you could stick a few bytes of header in the beginning of the ZIP file, to tell you the exact location of the header at the end of it, thus avoiding multiple lookups. It would even still be ZIP-compatible.
Definitely. I take an alternative but similar approach: since I control the zip files, I can guarantee that the header is always within the last N kilobytes of the zip file (configurable value of N). I spend a HEAD request to get the length of the zip file and then walk backwards by N kilobytes. You would request the few bytes at the beginning instead of using that request to get the file length.
How can you guarantee that it's always within N kilobytes, when N depends on the number of files in the zip?
If you're creating the zips in the first place, you can just check and see how big the headers are when you create them. If you happen to get N wrong, you can request another chunk, but obviously it's nice to avoid multiple requests to get the header. For my use case, the number of files is small and relatively consistent between zips so a generous value of 64KB ended up working great.