A Look Inside Our 210TB 2012 Web Corpus

Y	Hacker News new \| ask \| show \| jobs

	A Look Inside Our 210TB 2012 Web Corpus (commoncrawl.org)
	102 points by LisaG 4695 days ago

7 comments

mark_l_watson 4695 days ago

Check out the Common Crawl contest winning projects from the linked page - some very good work, and a good source of ideas and techniques: http://commoncrawl.org/the-winners-of-the-norvig-web-data-sc...

Some good stuff!

link

wicknicks 4695 days ago

I loved the inter-lingual web page linkage visualization project. Any idea why Traitor won the contest? It seems very similar to regular "create inverted index with map reduce" problem, or am I missing something?

link

mark_l_watson 4694 days ago

Perhaps Traitor won because it is such a good example of using Map Reduce over the Common Crawl data? I agree that inter-lingual was a cool project.

link

Aloisius 4695 days ago

Link to the PDF mentioned: https://docs.google.com/file/d/1_9698uglerxB9nAglvaHkEgU-iZN...

link

sylvinus 4695 days ago

Common Crawl is awesome. I wonder how complex it would be to run a Google-like frontend on top of it, and how good the results would be after a couple days of hacking...

link

boyter 4695 days ago

Very and probably not very good (Compare Gigablast to Google as an example of why its hard). Not to take anything away from Common Crawl but crawling is often one of the easier things to build when creating a search engine. A crawler can be as simple as

for(listofurls) { geturl; add urls to listofurls; }

Doing it on a large scale over and over is a harder problem (which common crawl does for you) but its not too difficult until you hit scale or want realtime crawling.

Building an index on 210 TB of data however... Assuming you use Sphinx/Solr/Gigablast you are going to need about 50 machines to deal with this amount of data with any sort of redundancy. That's just to hold a basic index which is not including "pagerank" or anything (Gigablast is a web engine so it might have that in there not sure). You aren't factoring in adding rankers to make it a webs search engine, spam/porn detection and all of the other stuff that goes with it. Then you get into serving results. Unless your indexes are in RAM you are going to have a pretty slow search engine. So add a lot more machines to hold the index for common terms in memory.

If someone is keen to do this however here are a list of articles/blogs which should get you started (wrote this originally as a HN comment which got a lot of attention so made it into a blog post) http://www.boyter.org/2013/01/want-to-write-a-search-engine-...

link

asgard1024 4695 days ago

Actually, not so simple. Sure, you can do simple crawling easily; but the hard part is to extract meaningful data from it. It's very easy to loop on many sites for instance. Protocol violations abound - some sites serve binaries as text/html, for instance.

What I heard about a smaller search engine was that web crawling is usually augmented with some manually added rules for various sites to prevent spoiling the database. Not a trivial task at all.

Doing queries is IMHO algorithmically much better understood, because it's a constrained problem. But getting information extracted out from the real world, with all the PHP and HTML "hackers", not so easy.

link

Sven7 4695 days ago

Which is one of the main reasons Google has no serious competition in search except possibly in China.

It is also why the rate of innovation in search isn't moving as fast as it can be moving.

If Google opened up (unlimited) web API access to their search interface, to say a large city for a year or two people would really get a taste of what innovation in search looked like.

And of course it would be in Google's interest cause search as a platform or marketplace is where the future of Google really lies. All the other advertising empire defending distractions like Android, Chrome and YouTube are really sideshows.

link

boyter 4695 days ago

Personally I consider extracting meaning from the crawl part of the indexing step. That just comes down to how you define it though. In reality its blurry as you need to do some pre-indexing during the crawl to extract meaningful data and as you say there are a lot of edge cases.

For basic crawling it really is as simple as while links download link though.

link

benhamner 4695 days ago

The massive advantages that Google has include over a decade of data on the pages that people actually visited in response to a specific query as well as having an in-memory index of the public web, parts of which are updated on the order of seconds to minutes.

I wonder if there is a viable business in maintaining an in-memory & up-to-date index of the public web & selling access to it, with a pricing model that scales according to the amount of computation you are doing on it.

link

greglindahl 4695 days ago

blekko has data customers playing us for that kind of thing.

link

greglindahl 4695 days ago

It would be challenging. You've got a crawl, but one with a fair bit of spam in it, despite the donation of blekko metadata. Then you have to figure out ranking for keywords, something that the blekko metadata won't help you with at all.

link

jamesaguilar 4695 days ago

Not very, or else someone would have replicated Google already (with fewer engineers and less money than Microsoft has thrown at the problem).

link

sytelus 4695 days ago

This is actually very small corpus, just 3 billion docs. Google, for example, is known to have 50 billion docs in index.

link

rgrieselhuber 4695 days ago

Is there something, other than funding, preventing a more regular, open-sourced crawl of the web?

link

LisaG 4695 days ago

Limited resources are the only reason. We are working on a subset crawl of ~3 million pages that will be published weekly starting two weeks from now. But doing the full crawl takes a lot of time, effort and money.

link

boyter 4695 days ago

Is that really worth it though? I can crawl 3 million pages in less than 24 hours without any real effort on my part. Or are you going to provide 3 million of the most useful pages? Depth or breadth first crawl?

link

LisaG 4695 days ago

We do think it is worth it to avoid duplicative efforts.

Suppose you crawl 3 million pages and you pay for the compute and storage costs. Then the next person who wants crawl data goes through the same effort and pays the same costs. Doesn't it make much more sense to have a common pool of open data that everyone can use? Even if the effort and costs are low, they are not zero.

For the smaller frequent crawl, we are working with Mozilla and we are will do the top pages (top according to Alexa).

link

boyter 4695 days ago

Fair point and makes sense. If you publish the rank along with the data itself that would be very useful. Perhaps having a few sets of data? 3 million top pages, 3 million deep pages etc...

Personally I would like to see around 20-100 million pages or whatever is about 500-1000GB. That's enough data to work with on a local machine and serve up some meaningful results assuming you want to build a search engine or just do some deep analysis of the web.

link

dsinha 4693 days ago

Isn't there also the additional factor that webservers sometimes allow only the major search engines to crawl? If so, with something like this, should it gain popularity, and as more apps start using it, you'd hope more webservers allow the common crawler to crawl their websites which they might not if everyone were doing it individually...thinking aloud...

link

frederi 4695 days ago

Just because you can do it without much effort doesn't mean less experienced people can. Crawling can be a barrier to some people.

link

boyter 4695 days ago

To be honest a simple crawler is a very simple thing to write. If someone had issues getting that going I think they are going to have issues with the data volume anyway. LisaG answered why the 3 million data set though and I agree with the reasoning.

link

toomuchtodo 4695 days ago

Could you partner with other orgs that have the same needs? Like the Internet Archive?

link

LisaG 4695 days ago

Internet Archive (currently) doesn't want to put their data on any cloud service. We believe it is crucial that people can easily access and analyze the data so we put it on various cloud platforms. We are talking with a few organizations about getting data donations that we could put in our corpus and make available to everyone, but nothing is settled enough that I can publicly comment on those potential partnerships yet.

link

danso 4695 days ago

The tables of TLD frequency on page 4 of the stats report are interesting, though it causes some confusion to me about how the crawler actually crawls and when it stops: https://docs.google.com/file/d/1_9698uglerxB9nAglvaHkEgU-iZN...

Table 2a purports to show the frequency of SLDs:

1 youtube.com 95,866,041 0.0250

2 blogspot.com 45,738,134 0.0119

3 tumblr.com 30,135,714 0.0079

4 flickr.com 9,942,237 0.0026

5 amazon.com 6,470,283 0.0017

6 google.com 2,782,762 0.0007

7 thefreedictionary.com 2,183,753 0.0006

8 tripod.com 1,874,452 0.0005

9 hotels.com 1,733,778 0.0005

10 flightaware.com 1,280,875 0.0003

If I'm reading this correctly, it seems that the crawler managed to hit up a huge number of youtube video pages...but only a fraction of them. I couldn't find a total number of Youtube video count, but Youtube's own stats page says 200 million videos alone have been tagged with Content-ID (identified as belonging to movie/tv studios).

In any case, it's surprising to not see Wikipedia on there. English wikipedia has 4+ million articles, so it should be ahead of thefreedictionary.com

link

wicknicks 4695 days ago

Good crawlers should typically avoid wikipedia links, to avoid the number of HTTP requests on wiki servers (and keep their costs down), esp. because they make available whole data dumps for download through a separate cheaper channel: http://en.wikipedia.org/wiki/Wikipedia:Database_download

link

gojomo 4695 days ago

Yes and no.

Some crawlers are most interested in freshest versions of the most inlinked articles, or in the exact HTML presentation at Wikipedia.

The monthly full raw wikitext dumps don't provide that.

And, Wikipedia's serving plant is pretty efficient, with bandwidth only being a small portion of their costs. They can afford some crawling... and correspondingly, their /robots.txt is pretty open.

Good crawlers seeking just the bulk text shouldn't try to grab the whole thing as fast as possible via the standard web URLs... but other good crawlers may want or need to visit discovered Wikipedia links, and doing so at a measured pace should be OK.

link

greglindahl 4695 days ago

blekko attempted to implement crawling a local copy, and it was a PITA. We'd rather crawl the real thing with a crawl-delay of 1. Best would be if the Wikimedia Foundation made a .html dump available.

link

jjwiseman 4695 days ago

There are at least 2.5M English wikipedia pages indexed in the crawl:

  $ cci_lookup org.wikipedia.en | wc -l
  2516956

(See https://github.com/wiseman/common_crawl_index, but note that the index is incomplete.)

link

spimmy 4695 days ago

What do you mean by "open"? Can the data be used for startups and other commercial purposes?

link

Aloisius 4695 days ago

Yes! Startups/commercial companies/etc can all use the data for free. The terms of use basically say, don't do anything illegal with it and a few other things, but it shouldn't affect the vast majority of uses.

Actually, tomorrow a video on a startup that uses Common Crawl data is getting posted.

link

CrazedGeek 4695 days ago

From the FAQ: "Please refer to the Common Crawl Terms of Use document for a detailed, authoritative description of our Terms of Use guidelines, but, in general, you cannot republish the data retrieved from the crawl (unless allowed by fair use), you cannot resell access to the service, you cannot use the crawl data for any illegal purposes, and you must respect the Terms of Use of the sites we crawl."

http://commoncrawl.org/about/terms-of-use/

link

res0nat0r 4695 days ago

The data is freely available: http://aws.amazon.com/datasets/41740

and you just need to comply with the Common Crawl TOU: http://commoncrawl.org/about/terms-of-use/

link

natch 4695 days ago

How does one get set up to access the s3:// links their blog posts reference? I do realize these point to Amazon S3 buckets, but how to get at them?

link

WestCoastJustin 4695 days ago

Just replace 's3://' with 'https://s3.amazonaws.com/'. You can use this link [1], but it looks like most of them are returning "Access Denied", so you would likely need to login with your AWS username/password to access them.

[1] https://s3.amazonaws.com/aws-publicdatasets/

link

Aloisius 4695 days ago

You need an Amazon account - though the data is available for free, I think you need to specify your access key to actually fetch it.

From there you can grab the S3 command line tools (http://s3tools.org/s3cmd) or load it up from hadoop or through one of the various open source libraries (boto for instance).

link