|
|
|
|
|
by psynapse
4163 days ago
|
|
I lead a team that works on several hundred bots scraping at high frequency. We also separate the problem of site structure and payload parsing, though slightly differently. We have a low frequency discovery process that delves the site to create a representative meta-data structure. This is then read by a high frequency process to create a list of URLs to fetch and parse each time. The behaviour can then be modified and/or work divided between processes by using command line arguments that cause filtering of the meta-data. |
|
Incidentally, at scale I find that the more tricky part is the whole orchestration (Schedule crawls, make sure resources are used most efficiently without overloading the target sites, properly detecting errors) is the hardest part.