Hacker News new | ask | show | jobs
by JeanFred 2075 days ago
Check out Kiwix: https://www.kiwix.org/

“Kiwix is an offline reader for online content like Wikipedia, Project Gutenberg, or TED Talks. It makes knowledge available to people with no or limited internet access. The software as well as the content is free to use for anyone.”

2 comments

Indeed, Kiwix is the proper solution for anyone who cares about knowledge access in places where internet connection is spotty. Given the file format it uses (zim) is compressed, and meant to be accessed compressed to pluck a specific article, you can have all the content of English's wikipedia without images and videos in just 36GB. Plus, the specs of the file format are published and it's easy to build your own implementation, I did it for my needs.

It's also great for privacy maximalists : I have wikipedia, wikisource and wiktionary in two languages locally, which means that most of my searches never leave my computer.

Kiwix/openzim don't provide only mediawiki projects either, I've recently downloaded stackoverflow's content (although, I'll need to build a dedicated search engine for it to be really usable).

You can have a look at the incredible amount of available content there: https://wiki.kiwix.org/wiki/Content_in_all_languages

> all the content of English's wikipedia without images and videos in just 36GB

36GB seems like a really big number if it's just text. A cursory Google search says 1MB will hold about 500 pages of text (ignoring compression). So 36GB would be something like 18 million pages? Let's say a 1000 page book is 10cm wide, so 18M pages wind up as 1800 meters of books, or 180 meter-wide bookshelves with 10 shelves each, which is maybe a large library? It seems like a lot of that must be external sources. I wonder what percentage was actually written by Wikipedia editors?

Not sure what you mean with external sources, but I have seen nothing but user generated content in there (but I haven't read all wikipedia articles, obviously).

A few things to note, though:

1/ it's not pure text content, it's html content, this has a significant overhead

2/ a zim file is not just compressed content, but also huge indexes referencing where is which content. You look for your article's title in the reference table, you find the position of your article in the file and you decompress just that part. This is what allows for selective decompression without decompressing the whole content.

The zim file format is far from ideal for compression efficiency - all the best algorithms typically don't allow random access without decompressing everything.

Also, wikipedia has a lot of spam and orphan pages, insanely long lists, etc. Those are hard to algorithmically filter out.

Wikipedia (english) currently has about 6.2 million pages https://en.wikipedia.org/wiki/Special:Statistics
I'd assume that figure would also include the indexes required for searching
I used this in Cuba. It was immensely useful, both to pass the time, and to look up many things of interest along the way without waiting to go to an internet zone.

I was a passenger in a car driving through central Cuba and thought I saw a sign towards Australia. Breaking out Kiwix I found the article and was relieved to see I wasn't going crazy.

Later on I was in an Uber in Australia driven by a Cuban man, and I thought I'd impress my friends with my worldliness by mentioning that there's a town in Cuba called Australia. The driver furrowed his brows and said flatly "no there's not", much to their delight. Can't win em all!

https://en.m.wikipedia.org/wiki/Australia,_Cuba