|
|
|
|
|
by Smerity
918 days ago
|
|
In the distant past I was the lone engineer of Common Crawl almost a decade ago. Common Crawl heavily leverages the WARC format. My favorite capability of the WARC format borrows from the fact that most compression formats can be written to allow random access. Compression formats such as `gzip` and `zstandard` allow multiple compressed streams to be stuck together and act during decompression as if it's one contiguous file. Hence you can create multiple compressions and literally stick them together: $ echo cat > cat.txt
$ echo dog > dog.txt
$ zstd cat.txt dog.txt
$ cat cat.txt.zst dog.txt.zst > catdog.zst
$ zstdcat catdog.zst
cat
dog
For files composed of only a textual / clearly delimited format that means you can fairly trivially leap to a different offset assuming each of the inputs is compressed individually. You lose out on some amount of compression but random lookup seems a fairly reasonable tradeoff.
Common Crawl was able to use this to allow entirely random lookups into web crawl datasets dozens / hundreds of terabytes in size without any change in file format for example and utilizing Amazon S3's support for HTTP Range requests[1].Trading compression for random lookup is even more forgiving if you create a separate compression dictionary tailored toward your dataset. For web crawling you'd likely get you the majority of the compression gains back unless pages from the same website are sequentially written which is unlikely in most situations. The website's shared template/s would result in very high compression gains across files which you'd lose by allowing random lookup but most crawlers don't don't operate sequentially so local compression gains are likely smaller than larger. [1]: https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requ... |
|