| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by egh 868 days ago

It's a pretty simple design, and it's based on the ARC format (https://archive.org/web/researcher/ArcFileFormat.php) which is even simpler. In response to your questions, here's my take (as somebody who used to work on web archiving).

1. Two reasons: First, many files are harder to manage. WARC files might contain hundreds or thousands of files. It's easier to manage big groups of files that are roughly the same size. Both for humans, and, at least in the past, for the file systems themselves. Second, once you break them up into files, what do you name the files? If you give them a name unrelated to the URL that was fetched, what is the advantage? If you name them based on the URL, suddenly you have a problem of mapping a URL to a legal file name, which can vary based on the file system. This would be a huge headache.

2. Yes, it predates SQLite, but also, why would you use sqlite? That's adding a huge amount of complexity. Is SQLite even good at storing big binary blobs?

Additionally, because of the clever way that WARC files are gzipped, each piece of the WARC file is gzipped individually, which allows random access into the file for reading enclosed content in a compressed file without needing to read the entire WARC file.

3 comments

Retr0id 868 days ago

> Is SQLite even good at storing big binary blobs

Nope! SQLite is good for lots of small-ish blobs (kilobytes), but once you start getting into the megabyte range, less so. There's also currently a hard upper limit blob size of 2GiB.

link

abracadaniel 868 days ago

To add to this, WARC.gz files are also concatenated gzip records, so you can read any record by starting a decompression at a known offset. This gives you the access time of a file with the efficiency of having many many records only taking up one file.

link

nikisweeting 868 days ago

WACZ also extends this functionality to allow streaming archives off a server without having to request the whole file to get one page. https://replayweb.page/docs/wacz-format

link

kburman 868 days ago

Thanks for the insights, egh! It's clear now why SQLite wouldn't be ideal for this purpose. Also, the point about URLs not always being valid filenames really makes sense.

link