Hacker News new | ask | show | jobs
by Asparagirl 914 days ago
Cool! But how did you get the initial dataset of 643,000+ Shopify stores (data as per your “About” page) in the first place, to then scrape the products from their /products.json feeds? Or did you just try a huge list of domain names at random?
3 comments

Bought an initial list of 2m stores for a few hundred dollars from a website called "Built With". Think they are used for building sales outreach lists. Then narrowed down the focus to stores to US only and between $100k - $1m in revenue to keep the initial data set manageable (and the CPU / Storage costs reasonable).
> and between $100k - $1m in revenue

Does "Built With" provide that data? How accurate do you think it might be?

https://www.shopify.com/robots.txt lists a lot of sitemap files, which tend to be a good starting point.
Did this suddenly get changed? Nothing but "# ,: # ,' | # / : # --' / # \/ />/ # /" is shown now.
It's just your browser's HTML parser. Line 6:

  #                         / <//_\
This is being interpreted as a malformed HTML closing tag, which (according to the HTML5 parsing algorithm published by WHATWG) gets treated as a comment. The file doesn't contain any > past this point. This leaves the uncommented contents from lines 1–6:

  #                               ,:
  #                             ,' |
  #                            /   :
  #                         --'   /
  #                         \/ />/
  #                         /
Or, with whitespace collapsed:

  # ,: # ,' | # / : # --' / # \/ />/ # /
Which should be exactly what you observe.

Ref: https://html.spec.whatwg.org/multipage/parsing.html https://developer.mozilla.org/en-US/docs/Web/CSS/white-space...

Weird. I think it did change. Google cache shows a 2229 line file: https://webcache.googleusercontent.com/search?q=cache%3Ahttp...
Seems it might be looking at the referrer. Loading https://www.shopify.com/robots.txt from clicking the link shows the weird line while opening it in a private browser window shows the right one.
For some reason, "view source" gets the right list. Maybe a referer issue like someone else said.
Looks like it's just Shopify's own pages and not anything related to actual stores.
It seems sort of questionable to use the list of things to not scrape as a starting point for scraping.... I mean, I get it's not actually enforced.
Not really sure why all the answers here are flagged, but you may be mistaken.

The robots.txt does not exclusively list what not to scrape.

It provides information on which parts are allowed and wich are not (disallowed).

It also provides sitemaps for crawlers as a starting point with more information (eg. which sites are available and how often are they updated, etc.)

Since ~2009 many crawlers recognize "Sitemap:" directives in robots.txt to link to sitemaps: https://en.wikipedia.org/wiki/Robots.txt#Sitemap
Shopify shops always have /collections, /products, and /pages in their URL. If you have a regular Shopify site, you're not allowed to change them. I don't know if Shopify Plus clients can change them.

Shopify sites also have shop-name.com/products.json which has URLs that point to cdn.shopify.com