At a first glance, the Wikimedia XML dump does not look substantially different from what Kiwix/ZIM does with compressed HTML: They're both compressed (bz2 for the Wikimedia dump, zstd or LZMA for Kiwix/ZIM), and both compress multiple files at once, so inter-file redundancy should hopefully be significantly reduced.
HTML seems a bit more verbose than the Mediawiki syntax (plus the XML header for each article), but I'd be surprised if that actually accounted for a 3x difference in size.
Then again, Kiwix seems to have experimented with shared dictionary brotli compression, which supposedly yields an >2x improvement: https://github.com/openzim/libzim/issues/144
I wonder if their current zstd implementation also uses shared dictionaries. If not, that might just be the reason: If ZIM compression chunks are much smaller than the bz2 streams of the Wikimedia dumps, there would still be a lot of redundancy between chunks.
The XML, which is larger, is now 21 GB. How did you get 60GB for the non-image English file?
https://dumps.wikimedia.org/enwiki/latest/
enwiki-latest-pages-articles.xml.bz2 02-Dec-2023 02:38 21557219519