Hacker News new | ask | show | jobs
by gondo 4191 days ago
you will need to manage the pool of IPs what is sometimes the hardest part as it has to be maintained. all the setup is usually one time job
1 comments

An easy solution to this is to have your scrapers on one network and a fleet of squid proxies on another. As and when an IP gets banned you just cycle in another squid instance on a new VM with a fresh IP (which can be from a completely different netblock/geolocation if you want). You don't have to touch the rest of the scraping apparatus.
I deal with this every day. Eventually we gave up on the proxy pool idea, and started running the headless browsers in the pool as Selenium nodes. It's definitely not easy- for example, we've also had to build infrastructure that helps keep track of IPs and their history.

We've open-sourced part of it. https://github.com/cardforcoin/shale

For some definition of easy, I guess.