Hacker News new | ask | show | jobs
by karteum 803 days ago
I guess it's without the pictures then ? Because if I compare with the zim file format (which is optimized for this use-case) https://kiwix.org/en/what-is-the-size-of-wikipedia/ I read "As of October 2022, the Full English Wikipedia (ca. 6.5 million articles), with images will use up 91GB of storage space (German and French, the second-largest: 36 GB). (...) If you can do without the images (what we call the nopic version), then you are down to 46 GB."
1 comments

Correct, there are no images in the data except for 68 PNGs. It's just HTML files.
how it's possible that a bunch of html files would add up to 200gb? is it because of some kind of overhead?

would maybe a database dump be smaller?

Well, "a bunch" is an understatement, I bet they have a bit more than just a bunch! It does pass a sniff test, since from https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia:

>As of May 2015, the current version of the English Wikipedia article / template / redirect text was about 51 GB uncompressed in XML format.

Compressed data at the same time was 11.5 GB. And that's data from 9 years ago, and just English Wikipedia.

For comparison, I collect leaked password dumps and they (combined, after deduplication) go into hundreds of GBs too. And that's for just username:password lines, not even text.