Hacker News new | ask | show | jobs
by omneity 337 days ago
I just posted incidentally about Wikipedia Monthly[0], a monthly dump of wikipedia broken down by language and cleaned MediaWiki markup into plain text, so perfect for a local search index or other scenarios.

There are 341 languages in there and 205GB of data, with English alone making up 24GB! My perspective on Simple English Wikipedia (from the OP), it's decent but the content tends to be shallow and imprecise.

0: https://omarkama.li/blog/wikipedia-monthly-fresh-clean-dumps...