Hacker News new | ask | show | jobs
by danso 4695 days ago
The tables of TLD frequency on page 4 of the stats report are interesting, though it causes some confusion to me about how the crawler actually crawls and when it stops: https://docs.google.com/file/d/1_9698uglerxB9nAglvaHkEgU-iZN...

Table 2a purports to show the frequency of SLDs:

1 youtube.com 95,866,041 0.0250

2 blogspot.com 45,738,134 0.0119

3 tumblr.com 30,135,714 0.0079

4 flickr.com 9,942,237 0.0026

5 amazon.com 6,470,283 0.0017

6 google.com 2,782,762 0.0007

7 thefreedictionary.com 2,183,753 0.0006

8 tripod.com 1,874,452 0.0005

9 hotels.com 1,733,778 0.0005

10 flightaware.com 1,280,875 0.0003

If I'm reading this correctly, it seems that the crawler managed to hit up a huge number of youtube video pages...but only a fraction of them. I couldn't find a total number of Youtube video count, but Youtube's own stats page says 200 million videos alone have been tagged with Content-ID (identified as belonging to movie/tv studios).

In any case, it's surprising to not see Wikipedia on there. English wikipedia has 4+ million articles, so it should be ahead of thefreedictionary.com

2 comments

Good crawlers should typically avoid wikipedia links, to avoid the number of HTTP requests on wiki servers (and keep their costs down), esp. because they make available whole data dumps for download through a separate cheaper channel: http://en.wikipedia.org/wiki/Wikipedia:Database_download
Yes and no.

Some crawlers are most interested in freshest versions of the most inlinked articles, or in the exact HTML presentation at Wikipedia.

The monthly full raw wikitext dumps don't provide that.

And, Wikipedia's serving plant is pretty efficient, with bandwidth only being a small portion of their costs. They can afford some crawling... and correspondingly, their /robots.txt is pretty open.

Good crawlers seeking just the bulk text shouldn't try to grab the whole thing as fast as possible via the standard web URLs... but other good crawlers may want or need to visit discovered Wikipedia links, and doing so at a measured pace should be OK.

blekko attempted to implement crawling a local copy, and it was a PITA. We'd rather crawl the real thing with a crawl-delay of 1. Best would be if the Wikimedia Foundation made a .html dump available.
There are at least 2.5M English wikipedia pages indexed in the crawl:

  $ cci_lookup org.wikipedia.en | wc -l
  2516956
(See https://github.com/wiseman/common_crawl_index, but note that the index is incomplete.)