| It would be interesting to extend it to zip, which is what redbean/greenbean use to serve static assets. Back in school, I worked on a project called Velox, with a partner - the idea was to take a bz2-compressed dump of the giant XML export of wikipedia, and write a program to serve that copy of wikipedia from disk (this was in 2008-2010? in my master's program, so before Kiwix and the amazing zim dumps they produce). My partner worked on the UI and indexing, and I was focusing on how to parse the bz2 compression format to locate article boundaries in the (giant) XML dump that Wikipedia provides. I ended up putting a lot of time into it because it was a bunch of fun. Writing this just sent me back to the presentation I made. The slide I wrote back then said: > Significant original work went into creation of archive access. The Apache BZip2 library that is part of Ant was used as a basis for archive access. > Modified to support random access to a given byte/bit offset pair within the compressed data stream (BZip2 is not a byte-aligned format)
> Extended to index all BZip2 block positions, allowing Java-based pseudo-random access to BZip2 compressed data
> Extended to map article IDs to block numbers for constant-time article retrieval, even in BZip2 archives exceeding 5GB in size > Current article retrieval times are ~2 seconds. This is back when the archive was ~7GB IIRC. My Kiwix dumps today are ~120GB, but that includes images. This is the link to the presentation in Google Slides that we wrote back in 2008 or so. The version history shows 2013, but I think some kind of import/conversion happened around that time. https://docs.google.com/presentation/d/e/2PACX-1vTfrxEqvHbd0... |
The more interesting trick you can do with zip files for HTTP serving is to serve the compressed deflate stream as gzip, or use Zstd inside zip. Then you have a valid zip file from which bytes can be served directly.
I have some code which does this at https://git.sr.ht/~dgl/deserve/