| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by corford 4238 days ago
	Or for scraping anything non-trivial, do it yourself with Casperjs, AWS/Linode/DO, RabbitMQ and a bit of monitoring/alerting from someone like Datadog. It will be cheaper and a lot more flexible. Edit: realised above sounds a bit harsh. Am sure Espion can fill a gap where clients need to scrape a limited amount of non-volatile data and don't have the time to setup and manage something on their own.

1 comments

gondo 4238 days ago

you will need to manage the pool of IPs what is sometimes the hardest part as it has to be maintained. all the setup is usually one time job

link

corford 4238 days ago

An easy solution to this is to have your scrapers on one network and a fleet of squid proxies on another. As and when an IP gets banned you just cycle in another squid instance on a new VM with a fresh IP (which can be from a completely different netblock/geolocation if you want). You don't have to touch the rest of the scraping apparatus.

link

mhluongo 4238 days ago

I deal with this every day. Eventually we gave up on the proxy pool idea, and started running the headless browsers in the pool as Selenium nodes. It's definitely not easy- for example, we've also had to build infrastructure that helps keep track of IPs and their history.

We've open-sourced part of it. https://github.com/cardforcoin/shale

link

eli 4238 days ago

For some definition of easy, I guess.

link