| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by grafraf 684 days ago

Unfortunately no, but i can share some insights that i hope can be of value:

- Tech: Everything is hosted in AWS. We are using Golang in docker containers that does the scraping. They run on ECS Fargate spots when needed using cronjob. The scraping result is stored as a parquet in S3 and processed in our RDS Postgresql. We need to be creative and have some methods to identify that a particular product A in store 1 is the same as product A in store 2 so they are mapped together. Sometimes it needs to be verified manually. The data that are of interest for the user/site is indexed into an Elastic search.

Things that might be of interest: - We always try to avoid parsing the HTML but instead calling the sites APIs directly to reduce scraping time. We also try to scrape the category listing to access multiple prices by one request, this can reduce the total requests from over 100 000 to maybe less than 1000 requests.

- We also try to avoid scraping the sites during peak times and respect their robots.txt. We add some delay to each request. The scrapes are often done during night/early morning.

- The main challenge is that stores can redesign or modify which make our scrapers fail, so we need to be fast and adopt to the new changes.

- Another major hidden challenge is that the stores have different prices for the same product depending on your zip code, so we have our ways of identifying the stores different warehouses, what zip codes belong to a specific warehouse and do a scrape for that warehouse. So a store might have 5 warehouses, so we need to scrape it 5 times with different zip codes

There is much more but i hope that gave you some insights of challenges and some solutions!

3 comments

showsover 684 days ago

Interesting stuff, thanks for the reply!

Do you run into issues where they block your scraping attempts or are they quite relaxed on this? Circumventing the bot detection often forces us to go for Puppeteer so we can fully control the browser, but that carries quite a heavy cost compared to using a simple HTTP requester.

link

grafraf 684 days ago

We have been blocked a couple of times during they years, usually using proxy has been enough. We try to reach out to the stores and try to establish a friendly relationship. The feelings have been mixed depending on what store we are talking to

link

ElCapitanMarkla 684 days ago

I'm unfamiliar with the parquet format and trying to understand - are you storing the raw scraped data in that format or are you storing the result of parsing the scraped data?

link

grafraf 683 days ago

We are storing the result of the parsed scrape as parquet. I would advice to store the raw data as well in a different s3. The database should only have the data it needs and not act as a storage.

link

sumedh 684 days ago

Have the sites tried to shut you down?

link

grafraf 684 days ago

We received some harsh words in the start but everything we are doing is legally and by the book.

We try to establish good relationship with the stores as the customers don't always focus on the price, but sometimes they want a specific product. We are both helping the stores and the customers to find each other. We have sent million of users over the years to the stores (not unique of course as there are only 9 million people living in Sweden)

link