|
|
|
|
|
by greglindahl
3929 days ago
|
|
Academics who write crawlers that don't do much with the pages they fetch can do 100s of millions of pages in a day with an ordinary server and a big, fat network pipe. At that speed they aren't even parsing html, they're using regexes to try to find URLs and that's about it. At blekko, we did ~ 100k pages/day/server with our production crawler, running on a cluster which was also doing anti-web-spam, inverting outgoing links into incoming links, indexing everything, and analytics batch jobs supporting development. So unless you're doing a LOT of work on every webpage, you're kinda slow. The easiest mistake to make is to not be asynch enough. This Python example is great. |
|