Hacker News new | ask | show | jobs
by grafraf 681 days ago
We have been doing it for the Swedish market in more than 8 years. We have a website https://www.matspar.se/ , where the customer can browse all the products of all major online stores, compare the prices and add the products they want to buy in the cart. The customer can in the end of the journey compare the total price of that cart (including shipping fee) and export the cart to the store they desire to order it.

I'm also one of the founders and the current CTO, so there been a lot of scraping and maintaining during the years. We are scraping over 30 million prices daily.

2 comments

On the business side, what's your business model, how do you generate revenue? What's the longer term goals?

(Public data shows the company have a revenue of ≈400k USD and 6 employees https://www.allabolag.se/5590076351/matspar-i-sverige-ab)

We are selling price/distribution data about the products we scrape. We do run some ads and have an affiliate deals.

The insight i can share is that the main (tech) goal is to make the product more user friendly and more aligned with the customer need as it has many pain points and we have gain some insights on the preferred customer journey.

Do you have a technical writeup of your scraping approach? I'd love to read more about the challenges and solutions for them.
Unfortunately no, but i can share some insights that i hope can be of value:

- Tech: Everything is hosted in AWS. We are using Golang in docker containers that does the scraping. They run on ECS Fargate spots when needed using cronjob. The scraping result is stored as a parquet in S3 and processed in our RDS Postgresql. We need to be creative and have some methods to identify that a particular product A in store 1 is the same as product A in store 2 so they are mapped together. Sometimes it needs to be verified manually. The data that are of interest for the user/site is indexed into an Elastic search.

Things that might be of interest: - We always try to avoid parsing the HTML but instead calling the sites APIs directly to reduce scraping time. We also try to scrape the category listing to access multiple prices by one request, this can reduce the total requests from over 100 000 to maybe less than 1000 requests.

- We also try to avoid scraping the sites during peak times and respect their robots.txt. We add some delay to each request. The scrapes are often done during night/early morning.

- The main challenge is that stores can redesign or modify which make our scrapers fail, so we need to be fast and adopt to the new changes.

- Another major hidden challenge is that the stores have different prices for the same product depending on your zip code, so we have our ways of identifying the stores different warehouses, what zip codes belong to a specific warehouse and do a scrape for that warehouse. So a store might have 5 warehouses, so we need to scrape it 5 times with different zip codes

There is much more but i hope that gave you some insights of challenges and some solutions!

Interesting stuff, thanks for the reply!

Do you run into issues where they block your scraping attempts or are they quite relaxed on this? Circumventing the bot detection often forces us to go for Puppeteer so we can fully control the browser, but that carries quite a heavy cost compared to using a simple HTTP requester.

We have been blocked a couple of times during they years, usually using proxy has been enough. We try to reach out to the stores and try to establish a friendly relationship. The feelings have been mixed depending on what store we are talking to
I'm unfamiliar with the parquet format and trying to understand - are you storing the raw scraped data in that format or are you storing the result of parsing the scraped data?
We are storing the result of the parsed scrape as parquet. I would advice to store the raw data as well in a different s3. The database should only have the data it needs and not act as a storage.
Have the sites tried to shut you down?
We received some harsh words in the start but everything we are doing is legally and by the book.

We try to establish good relationship with the stores as the customers don't always focus on the price, but sometimes they want a specific product. We are both helping the stores and the customers to find each other. We have sent million of users over the years to the stores (not unique of course as there are only 9 million people living in Sweden)