Hacker News new | ask | show | jobs
by corford 4190 days ago
An easy solution to this is to have your scrapers on one network and a fleet of squid proxies on another. As and when an IP gets banned you just cycle in another squid instance on a new VM with a fresh IP (which can be from a completely different netblock/geolocation if you want). You don't have to touch the rest of the scraping apparatus.
2 comments

I deal with this every day. Eventually we gave up on the proxy pool idea, and started running the headless browsers in the pool as Selenium nodes. It's definitely not easy- for example, we've also had to build infrastructure that helps keep track of IPs and their history.

We've open-sourced part of it. https://github.com/cardforcoin/shale

For some definition of easy, I guess.