Hacker News new | ask | show | jobs
by robbs 1199 days ago
IMO, this is the hardest part of maintaining a web scraper. We had ~100 scripts to scrape ~1000 clients' sites and it was, at minimum, 50 hours a week to keep up with changes.

The second hardest part was 30% of our clients all used the same hosting provider, which would start to fail at 10-20 req/s. We had to throttle the sites by IP, cluster-wide.

1 comments

This makes sense and I am curious about this. Was there consistency between those 1k client sites or were they all rather different? Mind if I reach out?