|
It's a pretty simple design, and it's based on the ARC format (https://archive.org/web/researcher/ArcFileFormat.php) which is even simpler. In response to your questions, here's my take (as somebody who used to work on web archiving). 1. Two reasons: First, many files are harder to manage. WARC files might contain hundreds or thousands of files. It's easier to manage big groups of files that are roughly the same size. Both for humans, and, at least in the past, for the file systems themselves. Second, once you break them up into files, what do you name the files? If you give them a name unrelated to the URL that was fetched, what is the advantage? If you name them based on the URL, suddenly you have a problem of mapping a URL to a legal file name, which can vary based on the file system. This would be a huge headache. 2. Yes, it predates SQLite, but also, why would you use sqlite? That's adding a huge amount of complexity. Is SQLite even good at storing big binary blobs? Additionally, because of the clever way that WARC files are gzipped, each piece of the WARC file is gzipped individually, which allows random access into the file for reading enclosed content in a compressed file without needing to read the entire WARC file. |
Nope! SQLite is good for lots of small-ish blobs (kilobytes), but once you start getting into the megabyte range, less so. There's also currently a hard upper limit blob size of 2GiB.