I just finished crawling 5.19B web pages, Ask Me Anything

Y	Hacker News new \| ask \| show \| jobs

	I just finished crawling 5.19B web pages, Ask Me Anything
	19 points by dor_jack 3353 days ago
	I WAS JUST RATE LIMITED BY HN, SO IM GOING TO ANSWER YOUR QUESTIONS UNDER A NEW ACCOUNT: dor_jack_2

7 comments

grzm 3353 days ago

If you're rate-limited, you can contact the mods via the Contact link in the footer.

link

dm_i386 3353 days ago

What tools did you use? What had to be custom-written and why?

link

dor_jack_2 3353 days ago

We tried a bunch of technologies like Nutch, Heritrix, Storm Crawler, ... eventually settled on Mixnode and since it's a 'cloud platform' we didn't really have to change anything.

As for processing the data we crawled, we are using ArchiveSpark (https://github.com/helgeho/ArchiveSpark)

Also, Mixnode defaults on Amazon S3 for storage which was ok with us since we're using EC2 for processing the results.

link

maurtinshkreli 3353 days ago

How much did it cost?

link

dor_jack 3353 days ago

It was an all inclusive deal: 420 TB at 0.06 per GB = $25,804

link

tlack 3353 days ago

what did you do to avoid winding up in endless GET url loops? How deep did you get per site, and how did you schedule followup requests?

link

dor_jack_2 3353 days ago

Loop/spam prevention was done by mixnode, I'm not sure how they do it.

The data does not follow a DFS or BFS pattern so pages/site varies greatly by a host's server capacity and anti-crawling configs.

There was a minimum of 10 seconds between followup requests to the same website unless robots.txt had a lower delay. Pretty polite...

link

joshpen188 3353 days ago

Why didn't you use common crawl instead?

link

dor_jack_2 3353 days ago

For our purposes Common Crawl's corpus was missing too many websites (possibly due to robots.txt configs of websites) Also we needed some deep coverage which CC could not provide.

link

savethefuture 3353 days ago

What did you discover.

link

dor_jack 3353 days ago

We are processing the data as we speak. However the movement of technology based on where your company is based is truly incredible.

Will update this in a few days with more data.

link

savethefuture 3353 days ago

That will be an interesting correlation to see different frameworks or tech or even design elements based on geographical location.

link

dor_jack 3353 days ago

If our company approves I would like to publish some general statistics that may be of interest to others.

link

savethefuture 3353 days ago

How did you crawl so many sites, how did you discover them, search engine, ip ranges or another method?

link

dor_jack 3353 days ago

The platform we used provided their own seedlist and took it from there.

link

savethefuture 3353 days ago

How long did it take? What type of data did you record?

link

dor_jack 3353 days ago

It took us about 13 days. We recorded reources of all types: text/, image/, application/*

As one would expect the vast majority of data recorded is text/* (html,...)

link

itburnslikeice 3353 days ago

but why?

link

dor_jack 3353 days ago

Our company is in the Marketing Intelligence (MI) industry. We needed to measure the penetration of multiple technologies in different countries.

link