Hacker News new | ask | show | jobs
by peterburkimsher 931 days ago
Kiwix is good, but the file size of offline Wikipedia is bloated compared to Wiki2Touch.

Kiwix is over 100 GB for the English Wikipedia. https://dumps.wikimedia.org/other/kiwix/zim/wikipedia/

wikipedia_en_all_maxi_2023-10.zim 31-Oct-2023 07:37 103214717026

http://www.haukap.com/wiki2touch/

Wiki2Touch is about 9 GB for en-wiki as of 2012, and about 14 GB now.

I've been working on a Wiki2Touch reader for more modern iPhones. I've got the BZ2 decompression working, now I'm just cleaning up the markup to HTML parser in JavaScript.

If you're interested, please email me and I'll send you the code! You can install it for a week on a non-jailbroken device using XCode self-signing, and if you're jailbroken you can install the Immortal tweak to make it work forever.

If you have a developer subscription and want to put it on the App Store, I'm be very happy to let you do that. I just don't want to pay $100 for the privilege of running my own code.

1 comments

what accounts for the large difference in size?
I think Kiwix includes images.
Even the non-image English file is about 60GB, so I’m really curious what Wiki2Touch does differently!

What date is that 14GB version from? And is it really not filtered at all?

The 14GB is from 2021-03-01. It's not filtered, it just stores the wikitext and an index of the article names, compressed with BZ2.

The XML, which is larger, is now 21 GB. How did you get 60GB for the non-image English file?

https://dumps.wikimedia.org/enwiki/latest/

enwiki-latest-pages-articles.xml.bz2 02-Dec-2023 02:38 21557219519

I meant the Kiwix dump (https://download.kiwix.org/zim/wikipedia_en_all_nopic.zim – careful, 60GB!).

At a first glance, the Wikimedia XML dump does not look substantially different from what Kiwix/ZIM does with compressed HTML: They're both compressed (bz2 for the Wikimedia dump, zstd or LZMA for Kiwix/ZIM), and both compress multiple files at once, so inter-file redundancy should hopefully be significantly reduced.

HTML seems a bit more verbose than the Mediawiki syntax (plus the XML header for each article), but I'd be surprised if that actually accounted for a 3x difference in size.

Then again, Kiwix seems to have experimented with shared dictionary brotli compression, which supposedly yields an >2x improvement: https://github.com/openzim/libzim/issues/144

I wonder if their current zstd implementation also uses shared dictionaries. If not, that might just be the reason: If ZIM compression chunks are much smaller than the bz2 streams of the Wikimedia dumps, there would still be a lot of redundancy between chunks.