And I've changed it a little bit to extract only the first n characters, this might be of some use since wikipedia dump are supposed to be pretty large: https://github.com/mooss/ruskea/blob/master/make_wiki_corpus... .