| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by misterbwong 961 days ago
	What technology did you use to build the scraper and how did you get around the usual challenges (anti bot, ip banning, etc) with scraping large amounts of data?

2 comments

pencildiver 961 days ago

Scraper is built in Javascript and a Mongo database. Probably not the most scalable way to do it, but I found that all Shopify stores have a public JSON file available at [Base URL]/products.json. So found a list of stores, built a crawler to go store-by-store, and standardized the data on my end.

Here's an example: https://www.wildfox.com/products.json

link

bomewish 960 days ago

What’s the trade off using js for this? Would it have been much faster to use go or something?

link

fermisea 961 days ago

Oh nice, you deserve great things in life for this comment!

link

satvikpendem 961 days ago

How did you detect that it was a Shopify store?

link

capableweb 961 days ago

Not OP but:

"Has the site a /products.json file?" is a good first check :) And if it does, "Does that format match with the format a Shopify store?" is another good followup question.

link

xp84 961 days ago

There are lots of telltale endpoints that you could just HEAD for a 200 vs 404. Or even just the products.json itself is a pretty good giveaway.

Or an even better way I’ve done in the past (to check which competitor’s platform a list of prospects is using in bulk) is just to use the DNS — a Shopify shop will be CNAMEd to a certain Shopify hostname.

link

8372049 961 days ago

In another comment, OP wrote:

> Bought an initial list of 2m stores for a few hundred dollars from a website called "Built With". Think they are used for building sales outreach lists. Then narrowed down the focus to stores to US only and between $100k - $1m in revenue to keep the initial data set manageable (and the CPU / Storage costs reasonable).

link

satvikpendem 961 days ago

Ah, that makes more sense, I used BuiltWith before.

link

stef25 960 days ago

Looked in the source of a random Shopify store, there are 200+ occurrences of "shopify", that's a clue :)

link

qdequelen 960 days ago

Did you only get the schema.json?

link

awill88 961 days ago

Excellent work

link

thomasfromcdnjs 961 days ago

ooo that is a hot tip!

link

cldellow 961 days ago

(not the OP, but I have some experience with Shopify)

Shopify stores publish their product catalog at /products.json. From personal experience, you can hammer it pretty hard without being rate limited.

A challenge is that the pricing info in that endpoint is based on the stock Shopify catalog fields, and can be misleading depending on the specific theme customizations that the merchant uses.

link