Hacker News new | ask | show | jobs
by mooss 2650 days ago
I've had some success using this tutorial: https://www.kdnuggets.com/2017/11/building-wikipedia-text-co... .

And I've changed it a little bit to extract only the first n characters, this might be of some use since wikipedia dump are supposed to be pretty large: https://github.com/mooss/ruskea/blob/master/make_wiki_corpus... .