Hacker News new | ask | show | jobs
by read_if_gay_ 2075 days ago
> all the content of English's wikipedia without images and videos in just 36GB

36GB seems like a really big number if it's just text. A cursory Google search says 1MB will hold about 500 pages of text (ignoring compression). So 36GB would be something like 18 million pages? Let's say a 1000 page book is 10cm wide, so 18M pages wind up as 1800 meters of books, or 180 meter-wide bookshelves with 10 shelves each, which is maybe a large library? It seems like a lot of that must be external sources. I wonder what percentage was actually written by Wikipedia editors?

4 comments

Not sure what you mean with external sources, but I have seen nothing but user generated content in there (but I haven't read all wikipedia articles, obviously).

A few things to note, though:

1/ it's not pure text content, it's html content, this has a significant overhead

2/ a zim file is not just compressed content, but also huge indexes referencing where is which content. You look for your article's title in the reference table, you find the position of your article in the file and you decompress just that part. This is what allows for selective decompression without decompressing the whole content.

The zim file format is far from ideal for compression efficiency - all the best algorithms typically don't allow random access without decompressing everything.

Also, wikipedia has a lot of spam and orphan pages, insanely long lists, etc. Those are hard to algorithmically filter out.

Wikipedia (english) currently has about 6.2 million pages https://en.wikipedia.org/wiki/Special:Statistics
I'd assume that figure would also include the indexes required for searching