| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by gojomo 4695 days ago

Yes and no.

Some crawlers are most interested in freshest versions of the most inlinked articles, or in the exact HTML presentation at Wikipedia.

The monthly full raw wikitext dumps don't provide that.

And, Wikipedia's serving plant is pretty efficient, with bandwidth only being a small portion of their costs. They can afford some crawling... and correspondingly, their /robots.txt is pretty open.

Good crawlers seeking just the bulk text shouldn't try to grab the whole thing as fast as possible via the standard web URLs... but other good crawlers may want or need to visit discovered Wikipedia links, and doing so at a measured pace should be OK.

1 comments

greglindahl 4695 days ago

blekko attempted to implement crawling a local copy, and it was a PITA. We'd rather crawl the real thing with a crawl-delay of 1. Best would be if the Wikimedia Foundation made a .html dump available.

link