Hacker News new | ask | show | jobs
by reinhardt 3933 days ago
It doesn't make much sense to give a number for speed without some specifics about the crawler environment, such as:

  - How many servers (if distributed)?
  - How many cores/server?
  - What kind of processing takes place for each page? 
    Does it just download and save the pages somewhere (local filesystem, cloud storage, database) or it extracts (semi) structured data? And so on.
Specifics aside, these days it's not hard to crawl millions of pages/day on commodity servers. Some related posts:

http://www.michaelnielsen.org/ddi/how-to-crawl-a-quarter-bil...

http://blog.semantics3.com/how-we-built-our-almost-distribut...

http://engineering.bloomreach.com/crawling-billions-of-pages...

http://engineering.bloomreach.com/crawling-billions-of-pages...

1 comments

Thank you very much!