Hacker News new | ask | show | jobs
I just finished crawling 5.19B web pages, Ask Me Anything
19 points by dor_jack 3353 days ago
I WAS JUST RATE LIMITED BY HN, SO IM GOING TO ANSWER YOUR QUESTIONS UNDER A NEW ACCOUNT: dor_jack_2
7 comments

If you're rate-limited, you can contact the mods via the Contact link in the footer.
What tools did you use? What had to be custom-written and why?
We tried a bunch of technologies like Nutch, Heritrix, Storm Crawler, ... eventually settled on Mixnode and since it's a 'cloud platform' we didn't really have to change anything.

As for processing the data we crawled, we are using ArchiveSpark (https://github.com/helgeho/ArchiveSpark)

Also, Mixnode defaults on Amazon S3 for storage which was ok with us since we're using EC2 for processing the results.

How much did it cost?
It was an all inclusive deal: 420 TB at 0.06 per GB = $25,804
what did you do to avoid winding up in endless GET url loops? How deep did you get per site, and how did you schedule followup requests?
Loop/spam prevention was done by mixnode, I'm not sure how they do it.

The data does not follow a DFS or BFS pattern so pages/site varies greatly by a host's server capacity and anti-crawling configs.

There was a minimum of 10 seconds between followup requests to the same website unless robots.txt had a lower delay. Pretty polite...

Why didn't you use common crawl instead?
For our purposes Common Crawl's corpus was missing too many websites (possibly due to robots.txt configs of websites) Also we needed some deep coverage which CC could not provide.
What did you discover.
We are processing the data as we speak. However the movement of technology based on where your company is based is truly incredible.

Will update this in a few days with more data.

That will be an interesting correlation to see different frameworks or tech or even design elements based on geographical location.
If our company approves I would like to publish some general statistics that may be of interest to others.
How did you crawl so many sites, how did you discover them, search engine, ip ranges or another method?
The platform we used provided their own seedlist and took it from there.
How long did it take? What type of data did you record?
It took us about 13 days. We recorded reources of all types: text/, image/, application/*

As one would expect the vast majority of data recorded is text/* (html,...)

but why?
Our company is in the Marketing Intelligence (MI) industry. We needed to measure the penetration of multiple technologies in different countries.