| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Asparagirl 914 days ago
	Cool! But how did you get the initial dataset of 643,000+ Shopify stores (data as per your “About” page) in the first place, to then scrape the products from their /products.json feeds? Or did you just try a huge list of domain names at random?

3 comments

pencildiver 914 days ago

Bought an initial list of 2m stores for a few hundred dollars from a website called "Built With". Think they are used for building sales outreach lists. Then narrowed down the focus to stores to US only and between $100k - $1m in revenue to keep the initial data set manageable (and the CPU / Storage costs reasonable).

link

russum 912 days ago

> and between $100k - $1m in revenue

Does "Built With" provide that data? How accurate do you think it might be?

link

xnx 914 days ago

https://www.shopify.com/robots.txt lists a lot of sitemap files, which tend to be a good starting point.

link

prayze 914 days ago

Did this suddenly get changed? Nothing but "# ,: # ,' | # / : # --' / # \/ />/ # /" is shown now.

link

wizzwizz4 914 days ago

It's just your browser's HTML parser. Line 6:

  #                         / <//_\

This is being interpreted as a malformed HTML closing tag, which (according to the HTML5 parsing algorithm published by WHATWG) gets treated as a comment. The file doesn't contain any > past this point. This leaves the uncommented contents from lines 1–6:

  #                               ,:
  #                             ,' |
  #                            /   :
  #                         --'   /
  #                         \/ />/
  #                         /

Or, with whitespace collapsed:

  # ,: # ,' | # / : # --' / # \/ />/ # /

Which should be exactly what you observe.

Ref: https://html.spec.whatwg.org/multipage/parsing.html https://developer.mozilla.org/en-US/docs/Web/CSS/white-space...

link

xnx 914 days ago

Weird. I think it did change. Google cache shows a 2229 line file: https://webcache.googleusercontent.com/search?q=cache%3Ahttp...

link

capableweb 914 days ago

Seems it might be looking at the referrer. Loading https://www.shopify.com/robots.txt from clicking the link shows the weird line while opening it in a private browser window shows the right one.

link

calebegg 914 days ago

For some reason, "view source" gets the right list. Maybe a referer issue like someone else said.

link

KomoD 913 days ago

Looks like it's just Shopify's own pages and not anything related to actual stores.

link

calebegg 914 days ago

It seems sort of questionable to use the list of things to not scrape as a starting point for scraping.... I mean, I get it's not actually enforced.

link

das_keyboard 914 days ago

Not really sure why all the answers here are flagged, but you may be mistaken.

The robots.txt does not exclusively list what not to scrape.

It provides information on which parts are allowed and wich are not (disallowed).

It also provides sitemaps for crawlers as a starting point with more information (eg. which sites are available and how often are they updated, etc.)

link

xnx 913 days ago

Since ~2009 many crawlers recognize "Sitemap:" directives in robots.txt to link to sitemaps: https://en.wikipedia.org/wiki/Robots.txt#Sitemap

link

patatero 914 days ago

Shopify shops always have /collections, /products, and /pages in their URL. If you have a regular Shopify site, you're not allowed to change them. I don't know if Shopify Plus clients can change them.

Shopify sites also have shop-name.com/products.json which has URLs that point to cdn.shopify.com

link