|
|
|
|
|
by rosha
2896 days ago
|
|
I tried several different queue systems best version I got is using Erlang Queue, Elixir & Kafka on top for doing high concurrent crawler, the project was to develop a realtime Amazon product ASIN price monitoring system for our company as a challenger prototype. Our main problem was basically proxies, we stopped buying them as managing thousands of proxies is a huge effort that we did not want to take, also lack of data means our Hadoop clusters gets thirsty and machines stops learning properly. Currently we are using a third party https://proxycrawl.com on very high tiers > 10B with a great discount and we are happy to get that part solved. Other lessons learnt are like sometimes things fail and logs help a lot so you will need a highly available Logging and monitoring system. |
|
I've played with Elixir as well, and it's also great for this type of thing.
proxycrawl.com looks very cool, I'm actually looking for a proxy service for my current scraping project. Are they also a good choice if you're doing lower tiers (like thousands of requests a day)?