| HN Mirror

There are a lot of papers/research on the design of a traditional web crawler.

Basically you have a link rank build out your crawl frontier and then you have an incremental ranking algorithm re-rank the frontier.

The problem is that latency is a major factor.

Many of the top social news posts start from some random user that witnesses something very interesting and then that shoots up rapidly.

Our goal basically has to be to index anything that's not spam and has a potential for being massive.

Additionally, a lot of the old school Google architecture applies. A lot of our infra is devoted to solving problems that would be insanely expensive to build out in the cloud.

We keep re-running the math but to purchase our infra on Amazon web services would be like 150-250k per month but we're doing it for about 12-15k per month.

It's definitely fun to have access to this much content though.

Additionally, our customers are brilliant and we get to work with the CTOs of some very cool companies which is always fun!