| The whole zim file infrastructure is pretty broken. I've been trying to put together a system for generating a WARC file by rendering all the wikitext content in a database dump, which is a lot more reasonable of an approach. Rendering wikitext is challenging though, since wikitext can include chunks of other wikitext, and wikitext can use some pretty complicated templating functionality. Oddly enough where I've run into the biggest issues is in weird slowdowns of the python WARCIO library that making dealing with large archives just about impossible. I haven't had time to really track that down, but if anyone want to it's pretty easy to reproduce, just try adding a few million lorum-ipsum articles and look at how far from linear time it's running. There are a lot of advantages to starting from a dump, you can provide much better tools for filtering articles, probably even provide rudimentary document classification. You can also do things like re-compress and minify images, a dump intended for a cellphone probably doesn't need 4k images. WARC is also probably a better tool for distributing web-archive type content, like wikipedia dumps. You can distribute a package of text content and image content as separate files, for example. Generally I have not been very impressed with the quality of ZIM file tooling. One disadvantage is you need to provide separate search indexing, but that's doable. I'd love to be able to get a wikimedia grant to work on this, and take on less contract work, but so far their grant process is pretty hard to follow. |
In general, I'd say that ZIM and WARC are not really direct competitors or solutions to the same problems, they're really for distinct use-cases. ZIM is a highly-compressed format that's designed solely for static articles and flat content, it doesn't really store headers or anything else that WARC does in order to support full request/response replaying. ZIM is optimized for storing thousands to millions of pages of homogenous content, WARC is optimized for high-fidelity collections of smaller amounts of content.
If you want to help out with our efforts, feel free to DM me on Twitter @theSquashSH or reply here and I can introduce you to the ZIM people (who get grants to improve this process on the regular, and are open to hiring contract workers).