Hacker News new | ask | show | jobs
by thomas536 2655 days ago
I also didn't find much information about how long it would take to import into a db, so I used the xml dumps directly [1]. I only needed the wiki content (not the history), so the article xml files worked well for me. And then I used mwparserfromhell [2] to parse and extract from the wiki markup.

[1] https://dumps.wikimedia.org/enwiki/20190301/

[2] https://mwparserfromhell.readthedocs.io/en/latest/

1 comments

I've been working on some research for a recommender system using xml wiki article dumps the last few months. I've been using mwparserfromhell as well to get plain text and some other metadata I needed from articles to create a dataset. It seems to work pretty well for that use case anyway.