|
|
|
|
|
by JimDabell
334 days ago
|
|
> Objectively, "I give you one (1) URL and you traverse the link to it so you can get some metadata" still counts as crawling, but I think that's not how most people conceptualize the term. It’s definitely not crawling as robots.txt defines the term.
: > WWW Robots (also called wanderers or spiders) are programs that traverse many pages in the World Wide Web by recursively retrieving linked pages. — https://www.robotstxt.org/orig.html You will see that reflected in lots of software that respects robots.txt. For instance, if you fetch a URL with wget, then it won’t look at robots.txt. But if you mirror a site with wget, then it will fetch the initial URL, then it will find the links in that page, then before fetching subsequent pages it will fetch and check robots.txt. |
|