| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by wraptile 1487 days ago

There's actually a difference between crawling and web scraping. Crawling discovers pages in a very loose manner by following all links, digesting them and producing more crawl tasks. Web Scraping, on the other hand, is a more controlled environment where the rules are pretty strict e.g. scrape `produc-<product id>.html` links for product data so web scrapers are very unlikely to stumble on some page randomly.

Also, unfortunately, robots.txt is rarely used to indicate non crawlable endpoints these days but instead is used as a way to withold public data. Just take any random big website and take a look at their robots.txt file:

User-agent: Googlebot Allow: / User-agent: * Disallow: /