|
|
|
|
|
by peterburkimsher
931 days ago
|
|
The 14GB is from 2021-03-01. It's not filtered, it just stores the wikitext and an index of the article names, compressed with BZ2. The XML, which is larger, is now 21 GB. How did you get 60GB for the non-image English file? https://dumps.wikimedia.org/enwiki/latest/ enwiki-latest-pages-articles.xml.bz2 02-Dec-2023 02:38 21557219519 |
|
At a first glance, the Wikimedia XML dump does not look substantially different from what Kiwix/ZIM does with compressed HTML: They're both compressed (bz2 for the Wikimedia dump, zstd or LZMA for Kiwix/ZIM), and both compress multiple files at once, so inter-file redundancy should hopefully be significantly reduced.
HTML seems a bit more verbose than the Mediawiki syntax (plus the XML header for each article), but I'd be surprised if that actually accounted for a 3x difference in size.
Then again, Kiwix seems to have experimented with shared dictionary brotli compression, which supposedly yields an >2x improvement: https://github.com/openzim/libzim/issues/144
I wonder if their current zstd implementation also uses shared dictionaries. If not, that might just be the reason: If ZIM compression chunks are much smaller than the bz2 streams of the Wikimedia dumps, there would still be a lot of redundancy between chunks.