Hacker News new | ask | show | jobs
by rosha 2896 days ago
I tried several different queue systems best version I got is using Erlang Queue, Elixir & Kafka on top for doing high concurrent crawler, the project was to develop a realtime Amazon product ASIN price monitoring system for our company as a challenger prototype. Our main problem was basically proxies, we stopped buying them as managing thousands of proxies is a huge effort that we did not want to take, also lack of data means our Hadoop clusters gets thirsty and machines stops learning properly. Currently we are using a third party https://proxycrawl.com on very high tiers > 10B with a great discount and we are happy to get that part solved. Other lessons learnt are like sometimes things fail and logs help a lot so you will need a highly available Logging and monitoring system.
1 comments

Back when I worked for a very large tech company, building their web crawler, I had good success with Golang. On four servers, with 10 GigE interconnect and SSD, and a very fast pipe to the Internet, I was able to push about 10K pages / second sustained. At any given time, there were probably several million connections open concurrently.

I've played with Elixir as well, and it's also great for this type of thing.

proxycrawl.com looks very cool, I'm actually looking for a proxy service for my current scraping project. Are they also a good choice if you're doing lower tiers (like thousands of requests a day)?

Golang is a good choice too but in my experience its nothing compared to what you can do with Erlang Queue and Elixir. Regarding your question about proxycrawl, I do not know honestly, I tested the service for few days on some few millions per day and it was great too. I would say they are good for a very high volume, we are still using it, so that should be a good signal to try them.