| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by pencildiver 961 days ago
	Scraper is built in Javascript and a Mongo database. Probably not the most scalable way to do it, but I found that all Shopify stores have a public JSON file available at [Base URL]/products.json. So found a list of stores, built a crawler to go store-by-store, and standardized the data on my end. Here's an example: https://www.wildfox.com/products.json

6 comments

bomewish 961 days ago

What’s the trade off using js for this? Would it have been much faster to use go or something?

link

fermisea 961 days ago

Oh nice, you deserve great things in life for this comment!

link

satvikpendem 961 days ago

How did you detect that it was a Shopify store?

link

capableweb 961 days ago

Not OP but:

"Has the site a /products.json file?" is a good first check :) And if it does, "Does that format match with the format a Shopify store?" is another good followup question.

link

xp84 961 days ago

There are lots of telltale endpoints that you could just HEAD for a 200 vs 404. Or even just the products.json itself is a pretty good giveaway.

Or an even better way I’ve done in the past (to check which competitor’s platform a list of prospects is using in bulk) is just to use the DNS — a Shopify shop will be CNAMEd to a certain Shopify hostname.

link

8372049 961 days ago

In another comment, OP wrote:

> Bought an initial list of 2m stores for a few hundred dollars from a website called "Built With". Think they are used for building sales outreach lists. Then narrowed down the focus to stores to US only and between $100k - $1m in revenue to keep the initial data set manageable (and the CPU / Storage costs reasonable).

link

satvikpendem 961 days ago

Ah, that makes more sense, I used BuiltWith before.

link

stef25 960 days ago

Looked in the source of a random Shopify store, there are 200+ occurrences of "shopify", that's a clue :)

link

qdequelen 960 days ago

Did you only get the schema.json?

link

awill88 961 days ago

Excellent work

link

thomasfromcdnjs 961 days ago

ooo that is a hot tip!

link