| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by JimDabell 334 days ago

> Objectively, "I give you one (1) URL and you traverse the link to it so you can get some metadata" still counts as crawling, but I think that's not how most people conceptualize the term.

It’s definitely not crawling as robots.txt defines the term. :

> WWW Robots (also called wanderers or spiders) are programs that traverse many pages in the World Wide Web by recursively retrieving linked pages.

— https://www.robotstxt.org/orig.html

You will see that reflected in lots of software that respects robots.txt. For instance, if you fetch a URL with wget, then it won’t look at robots.txt. But if you mirror a site with wget, then it will fetch the initial URL, then it will find the links in that page, then before fetching subsequent pages it will fetch and check robots.txt.