| Unfortunately no, but i can share some insights that i hope can be of value: - Tech: Everything is hosted in AWS. We are using Golang in docker containers that does the scraping. They run on ECS Fargate spots when needed using cronjob. The scraping result is stored as a parquet in S3 and processed in our RDS Postgresql. We need to be creative and have some methods to identify that a particular product A in store 1 is the same as product A in store 2 so they are mapped together. Sometimes it needs to be verified manually. The data that are of interest for the user/site is indexed into an Elastic search. Things that might be of interest:
- We always try to avoid parsing the HTML but instead calling the sites APIs directly to reduce scraping time. We also try to scrape the category listing to access multiple prices by one request, this can reduce the total requests from over 100 000 to maybe less than 1000 requests. - We also try to avoid scraping the sites during peak times and respect their robots.txt. We add some delay to each request. The scrapes are often done during night/early morning. - The main challenge is that stores can redesign or modify which make our scrapers fail, so we need to be fast and adopt to the new changes. - Another major hidden challenge is that the stores have different prices for the same product depending on your zip code, so we have our ways of identifying the stores different warehouses, what zip codes belong to a specific warehouse and do a scrape for that warehouse. So a store might have 5 warehouses, so we need to scrape it 5 times with different zip codes There is much more but i hope that gave you some insights of challenges and some solutions! |
Do you run into issues where they block your scraping attempts or are they quite relaxed on this? Circumventing the bot detection often forces us to go for Puppeteer so we can fully control the browser, but that carries quite a heavy cost compared to using a simple HTTP requester.