|
|
|
|
|
by fsg7sdfg789
3929 days ago
|
|
It's not a significant number of pages per day, honestly. For me, the limiting factor is almost always how many concurrent requests I feel comfortable making to the remote server. For big sites, the proxy I use generally caps it at 5 req / domain (concurrently). I generally use distributed crawlers, which means I can scale to millions of pages per day (assuming different domains). The biggest limiting factor is the database layer, how many writes can I do in a day. If I need to go faster, I just spin up another crawler worker, which connects to the queue and starts pulling jobs. I believe anything under a million pages / day should be do-able by a homebuilt, single-server system. |
|