Hacker News new | ask | show | jobs
by wicknicks 4695 days ago
Good crawlers should typically avoid wikipedia links, to avoid the number of HTTP requests on wiki servers (and keep their costs down), esp. because they make available whole data dumps for download through a separate cheaper channel: http://en.wikipedia.org/wiki/Wikipedia:Database_download
1 comments

Yes and no.

Some crawlers are most interested in freshest versions of the most inlinked articles, or in the exact HTML presentation at Wikipedia.

The monthly full raw wikitext dumps don't provide that.

And, Wikipedia's serving plant is pretty efficient, with bandwidth only being a small portion of their costs. They can afford some crawling... and correspondingly, their /robots.txt is pretty open.

Good crawlers seeking just the bulk text shouldn't try to grab the whole thing as fast as possible via the standard web URLs... but other good crawlers may want or need to visit discovered Wikipedia links, and doing so at a measured pace should be OK.

blekko attempted to implement crawling a local copy, and it was a PITA. We'd rather crawl the real thing with a crawl-delay of 1. Best would be if the Wikimedia Foundation made a .html dump available.