| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by greglindahl 3929 days ago

Academics who write crawlers that don't do much with the pages they fetch can do 100s of millions of pages in a day with an ordinary server and a big, fat network pipe. At that speed they aren't even parsing html, they're using regexes to try to find URLs and that's about it.

At blekko, we did ~ 100k pages/day/server with our production crawler, running on a cluster which was also doing anti-web-spam, inverting outgoing links into incoming links, indexing everything, and analytics batch jobs supporting development.

So unless you're doing a LOT of work on every webpage, you're kinda slow.

The easiest mistake to make is to not be asynch enough. This Python example is great.