| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by iamchrisle 4564 days ago

Moz is an analytics startup for marketers. It essentially crawls the internet to create a graph of links and reports that in their user interface. They also calculate a score based on the authority of links similar but not the same as Google's Pagerank. They also do social tracking.

Technical details:

I don't know exactly how they do it but my guess is that they are I/O bound on the crawling side, then CPU bound on the parsing and processing side. I'm assuming they use different machines for those tasks.

On the crawling side, the index with 60-70 billion URLS was using 80 cc2.8xlarge machines with a backup on 200 c1.xlarge machines. (http://moz.com/blog/one-step-back-two-steps-forward)

(DISCLAIMER: I'm in the same industry. I am a developer with working for an indirect competitor. I know a few engineers and non-engineers who work at Moz.)

1 comments

AznHisoka 4564 days ago

Rand replied that they don't use AWS for crawling.

I would imagine the main bottleneck, by far is the I/O with reading/writing to the database clusters and search index. Crawling is relatively cheap, it's storing that data that's hard.

link