Hacker News new | ask | show | jobs
by fsg7sdfg789 3929 days ago
It's not a significant number of pages per day, honestly. For me, the limiting factor is almost always how many concurrent requests I feel comfortable making to the remote server. For big sites, the proxy I use generally caps it at 5 req / domain (concurrently).

I generally use distributed crawlers, which means I can scale to millions of pages per day (assuming different domains). The biggest limiting factor is the database layer, how many writes can I do in a day.

If I need to go faster, I just spin up another crawler worker, which connects to the queue and starts pulling jobs.

I believe anything under a million pages / day should be do-able by a homebuilt, single-server system.

1 comments

Thanks. Well, this does make me wonder if we are doing something wrong or we are performing actions that are slowing down the crawling. We have a good server (I believe).
You might want to benchmark where your software is spending its time. A typical overhead is in the connection time. You may be able to speed things up with a local dns cache and by using http keep-alive. You also generally want to do a lot of parallel requests, since most time would be spent waiting for the subject site to respond.
Don't forget to check your connection too; maybe you are filling it up or have latency issues.
I just reached out to you via email to see if I can help.