Large scale crawling is primarily a challenge in balancing the logistics in a way that is kind to both the crawler and the data consumers.
Distributed crawling, if you go that way, is also non-trivial as you're effectively juggling a shared rapidly mutating state in the dozens gigabytes.
Large scale crawling is primarily a challenge in balancing the logistics in a way that is kind to both the crawler and the data consumers.
Distributed crawling, if you go that way, is also non-trivial as you're effectively juggling a shared rapidly mutating state in the dozens gigabytes.