|
|
|
|
|
by tommccabe
4230 days ago
|
|
Looks like I'm at one of the retailers you crawl. Recently, our site was getting hit with a web crawler that was following links incorrectly. I black listed several IP addresses from accessing the site and now I wonder if it was this! Does your crawler obey robots.txt rules? |
|
User-Agent: established_company
Allow: /some-stuff
User-Agent: *
Disallow: /
# keeps out filthy peasants
And you're either stuck following them, and not having data that would be offered up for free if you were someone else, or being a bad person and ignoring it. You don't really see the services that follow the rules.
Also, good paper on how much being on robots.txt preferred helps, which makes you a better product, which makes you more preferred...
https://etda.libraries.psu.edu/paper/9230/4516