|
|
|
|
|
by electroly
918 days ago
|
|
Isn't this a benefit you'd trivially get just by using .zip? I pull individual files out of large .zip archives in S3 using HTTP range requests; works exactly as you'd expect. You know the zip header is at the end of the file, and the header tells you the offset and length of the compressed entry data so you can request the range. Two requests if you've never seen the .zip before, one if you've got the zip header cached. |
|
WARC as a format essentially states that unless you have good reason "record at a time" compression is the preferred[1]. The mixture of "technically possible" and "part of spec" is what makes it so useful - any generic WARC tool can support random access, there are explicit fields to index over (URL), and even non-conforming WARC files can be easily rewritten to add such a capability.
[1]: https://iipc.github.io/warc-specifications/specifications/wa...