Hacker News new | ask | show | jobs
by yreg 416 days ago
Regarding mwoffliner: Why scrape Wikipedia when you can just download a dump?
2 comments

If you want to test Mediawiki tooling, wikipedia is good test target, because it uses a lot of the features (unsurprisingly), compared to smaller wikis. (OTOH, the latter often have custom extensions, so it's not quite enough)
Sure, but I understood the parent as saying that the tool primarily serves for scraping Wikipedia.
I was thinking the same. It must take much less space in database form than all the html pages.
Its also kind of bad form to scrape a huge website when there's a downloadable dump available. Save yourself, and more importantly wikimedia, a whole lot of bandwidth & CPU cycles.
And torrenting the dumps helps distribute them to others as well.