Hacker News new | ask | show | jobs
by troels 3930 days ago
I have a crawler setup the pulls a few million pages per day. The main constraint is not in the crawler setup, but rather in how much load the subject sites can withstand. If I don't throttle down the traffic, the sites will be dos'ed very quickly. Of course, this is mainly a problem because I crawl a lot of pages from each site - if you have a crawler that crawls a few pages from a lot of sites, you would have a different scenario.