|
As someone who has to deal with a lot of bots, bot networks and other weird scraper apps people use: The biggest issue is that most of these tools are not behaved very well. This tool is clearly designed to circumvent protections, rate limits mostly, against scraping that might be essential to keep things running. They follow links that are explicitly marked as do not follow, they do not even try to limit their rate, they spoof their user agent strings etc etc. These bots cause real problems and cost real money. I do not think that kind of misuse is ethical. In fact, using this tool to circumvent protections can turn your scraping into a DDOS attack, which I do not feel are ethical. If your bot behaves itself though, public information is public imo. Just don't take down websites, respect rate limits and do not follow 'no-follow' links. To give an idea of the size of the issue, we have websites for customers that have maybe 5 hits per minute from actual users. Then _suddenly_ you go to 500 hits/minute for a couple of hours, because some bot is trying to scrape a calendar and is now looking for events in 1850 or whatever. (Not the greatest software that these links are still there tbh, but that is out of my control.) Or another situation, not entirely related, but interesting i think:
A few years back for days on end 80% of our total traffic came from random IPs across china & request could be traced through HTTP referrers where the 'user' had apparently opened a page in one province, then traveled to the other side of China and clicked a link 2 hours later. All these things are relatively easy to mitigate, but that doesn't make it ethical. |
The only thing that rel="nofollow" does is tell search engines not to use that link in their PageRank computation.
If you do want to block well-behaved crawlers from crawling parts of your site, the proper way to do that is to use robots.txt rules.