| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by rpdillon 8 days ago

It would be interesting to extend it to zip, which is what redbean/greenbean use to serve static assets.

Back in school, I worked on a project called Velox, with a partner - the idea was to take a bz2-compressed dump of the giant XML export of wikipedia, and write a program to serve that copy of wikipedia from disk (this was in 2008-2010? in my master's program, so before Kiwix and the amazing zim dumps they produce). My partner worked on the UI and indexing, and I was focusing on how to parse the bz2 compression format to locate article boundaries in the (giant) XML dump that Wikipedia provides. I ended up putting a lot of time into it because it was a bunch of fun.

Writing this just sent me back to the presentation I made. The slide I wrote back then said:

> Significant original work went into creation of archive access. The Apache BZip2 library that is part of Ant was used as a basis for archive access.

> Modified to support random access to a given byte/bit offset pair within the compressed data stream (BZip2 is not a byte-aligned format) > Extended to index all BZip2 block positions, allowing Java-based pseudo-random access to BZip2 compressed data > Extended to map article IDs to block numbers for constant-time article retrieval, even in BZip2 archives exceeding 5GB in size

> Current article retrieval times are ~2 seconds.

This is back when the archive was ~7GB IIRC. My Kiwix dumps today are ~120GB, but that includes images.

This is the link to the presentation in Google Slides that we wrote back in 2008 or so. The version history shows 2013, but I think some kind of import/conversion happened around that time.

https://docs.google.com/presentation/d/e/2PACX-1vTfrxEqvHbd0...

1 comments

dgl 8 days ago

Zip isn't useful for random access here; the problem with random access in HTTP serving is then you have to decompress the data and potentially recompress.

The more interesting trick you can do with zip files for HTTP serving is to serve the compressed deflate stream as gzip, or use Zstd inside zip. Then you have a valid zip file from which bytes can be served directly.

I have some code which does this at https://git.sr.ht/~dgl/deserve/

link