Hacker News new | ask | show | jobs
by corford 4191 days ago
Or for scraping anything non-trivial, do it yourself with Casperjs, AWS/Linode/DO, RabbitMQ and a bit of monitoring/alerting from someone like Datadog. It will be cheaper and a lot more flexible.

Edit: realised above sounds a bit harsh. Am sure Espion can fill a gap where clients need to scrape a limited amount of non-volatile data and don't have the time to setup and manage something on their own.

1 comments

you will need to manage the pool of IPs what is sometimes the hardest part as it has to be maintained. all the setup is usually one time job
An easy solution to this is to have your scrapers on one network and a fleet of squid proxies on another. As and when an IP gets banned you just cycle in another squid instance on a new VM with a fresh IP (which can be from a completely different netblock/geolocation if you want). You don't have to touch the rest of the scraping apparatus.
I deal with this every day. Eventually we gave up on the proxy pool idea, and started running the headless browsers in the pool as Selenium nodes. It's definitely not easy- for example, we've also had to build infrastructure that helps keep track of IPs and their history.

We've open-sourced part of it. https://github.com/cardforcoin/shale

For some definition of easy, I guess.