What technology did you use to build the scraper and how did you get around the usual challenges (anti bot, ip banning, etc) with scraping large amounts of data?
Scraper is built in Javascript and a Mongo database. Probably not the most scalable way to do it, but I found that all Shopify stores have a public JSON file available at [Base URL]/products.json. So found a list of stores, built a crawler to go store-by-store, and standardized the data on my end.
"Has the site a /products.json file?" is a good first check :) And if it does, "Does that format match with the format a Shopify store?" is another good followup question.
There are lots of telltale endpoints that you could just HEAD for a 200 vs 404. Or even just the products.json itself is a pretty good giveaway.
Or an even better way I’ve done in the past (to check which competitor’s platform a list of prospects is using in bulk) is just to use the DNS — a Shopify shop will be CNAMEd to a certain Shopify hostname.
> Bought an initial list of 2m stores for a few hundred dollars from a website called "Built With". Think they are used for building sales outreach lists. Then narrowed down the focus to stores to US only and between $100k - $1m in revenue to keep the initial data set manageable (and the CPU / Storage costs reasonable).
(not the OP, but I have some experience with Shopify)
Shopify stores publish their product catalog at /products.json. From personal experience, you can hammer it pretty hard without being rate limited.
A challenge is that the pricing info in that endpoint is based on the stock Shopify catalog fields, and can be misleading depending on the specific theme customizations that the merchant uses.
Here's an example: https://www.wildfox.com/products.json