Hacker News new | ask | show | jobs
by worried_citizen 3822 days ago
Web crawling is just like most things: 80% of the results for 20% of the work. It's always the last mile that takes the most significant cost and engineering effort.

as your index and scale grow you bump into the really difficult problems:

1. How do you handle so many DNS requests/sec without overloading upstream servers?

2. How do you discover and determine the quality of new links? It's only a matter of time until your crawler hits a treasure trove of spam domains.

3. How do you store, update, and access an index that's exponentially growing?

Just some ideas.

1 comments

I've been working on something similar and I have ran into some of the issues you mention. As you correctly pointed out, quality and post processing is also relevant to not crawl irrelevant/spam sites, which can be HUGE! The work presented here is cool but it does not address the whole picture. Having a crawler that takes quality and user feedback into account is the hard part. Not to mention if you are being polite with the requests... we need to scale but not ignoring Robots.txt

So crawling a billion of links in X number of hours is not trivial but not that hard specially with cloud infrastructure like AWS, it's just a matter of a good enough implementation and how much money one wants to spend on it.