Hacker News new | ask | show | jobs
by lxgr 931 days ago
Even the non-image English file is about 60GB, so I’m really curious what Wiki2Touch does differently!

What date is that 14GB version from? And is it really not filtered at all?

1 comments

The 14GB is from 2021-03-01. It's not filtered, it just stores the wikitext and an index of the article names, compressed with BZ2.

The XML, which is larger, is now 21 GB. How did you get 60GB for the non-image English file?

https://dumps.wikimedia.org/enwiki/latest/

enwiki-latest-pages-articles.xml.bz2 02-Dec-2023 02:38 21557219519

I meant the Kiwix dump (https://download.kiwix.org/zim/wikipedia_en_all_nopic.zim – careful, 60GB!).

At a first glance, the Wikimedia XML dump does not look substantially different from what Kiwix/ZIM does with compressed HTML: They're both compressed (bz2 for the Wikimedia dump, zstd or LZMA for Kiwix/ZIM), and both compress multiple files at once, so inter-file redundancy should hopefully be significantly reduced.

HTML seems a bit more verbose than the Mediawiki syntax (plus the XML header for each article), but I'd be surprised if that actually accounted for a 3x difference in size.

Then again, Kiwix seems to have experimented with shared dictionary brotli compression, which supposedly yields an >2x improvement: https://github.com/openzim/libzim/issues/144

I wonder if their current zstd implementation also uses shared dictionaries. If not, that might just be the reason: If ZIM compression chunks are much smaller than the bz2 streams of the Wikimedia dumps, there would still be a lot of redundancy between chunks.