Hacker News new | ask | show | jobs
by pencildiver 914 days ago
Scraper is built in Javascript and a Mongo database. Probably not the most scalable way to do it, but I found that all Shopify stores have a public JSON file available at [Base URL]/products.json. So found a list of stores, built a crawler to go store-by-store, and standardized the data on my end.

Here's an example: https://www.wildfox.com/products.json

6 comments

What’s the trade off using js for this? Would it have been much faster to use go or something?
Oh nice, you deserve great things in life for this comment!
How did you detect that it was a Shopify store?
Not OP but:

"Has the site a /products.json file?" is a good first check :) And if it does, "Does that format match with the format a Shopify store?" is another good followup question.

There are lots of telltale endpoints that you could just HEAD for a 200 vs 404. Or even just the products.json itself is a pretty good giveaway.

Or an even better way I’ve done in the past (to check which competitor’s platform a list of prospects is using in bulk) is just to use the DNS — a Shopify shop will be CNAMEd to a certain Shopify hostname.

In another comment, OP wrote:

> Bought an initial list of 2m stores for a few hundred dollars from a website called "Built With". Think they are used for building sales outreach lists. Then narrowed down the focus to stores to US only and between $100k - $1m in revenue to keep the initial data set manageable (and the CPU / Storage costs reasonable).

Ah, that makes more sense, I used BuiltWith before.
Looked in the source of a random Shopify store, there are 200+ occurrences of "shopify", that's a clue :)
Did you only get the schema.json?
Excellent work
ooo that is a hot tip!