Hacker News new | ask | show | jobs
by ghukill 874 days ago
If interested in WARC, recommend also checking out WACZ: https://specs.webrecorder.net/wacz/1.1.1/
1 comments

What's the point of WACZ? It appears to wrap a number of WARC files into a single zip, enabling Range requests to specific WARC files so it can be served by a passive file server. But why is that needed?
It's huge for being able to replay big WARC files in a browser without having to download the whole thing. (e.g. try loading a 700mb WARC from IPFS to visit one page within it, it's too slow to work as-is)

It's used extensively by the Browsertrix/Webrecorder.io projects (who's team pioneered the WACZ format) and a few other projects.

Oh I may have missed that part. So the WACZ (indexes?) can contains offsets into the WARC file itself to each individual page?
WACZ is a replacement for WARC that has the index with offsets built in.
But it uses warc files inside as the archive format. It seems weird to call it a replacement when the original is still present.
I just meant from a user's perspective it's a format that superseeds WARC. But internally, yes, one is an encapsulation format for the other.