| HN Mirror

I've been working on something similar and I have ran into some of the issues you mention. As you correctly pointed out, quality and post processing is also relevant to not crawl irrelevant/spam sites, which can be HUGE! The work presented here is cool but it does not address the whole picture. Having a crawler that takes quality and user feedback into account is the hard part. Not to mention if you are being polite with the requests... we need to scale but not ignoring Robots.txt

So crawling a billion of links in X number of hours is not trivial but not that hard specially with cloud infrastructure like AWS, it's just a matter of a good enough implementation and how much money one wants to spend on it.