| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by holstvoogd 1819 days ago

As someone who has to deal with a lot of bots, bot networks and other weird scraper apps people use: The biggest issue is that most of these tools are not behaved very well. This tool is clearly designed to circumvent protections, rate limits mostly, against scraping that might be essential to keep things running.

They follow links that are explicitly marked as do not follow, they do not even try to limit their rate, they spoof their user agent strings etc etc. These bots cause real problems and cost real money. I do not think that kind of misuse is ethical. In fact, using this tool to circumvent protections can turn your scraping into a DDOS attack, which I do not feel are ethical.

If your bot behaves itself though, public information is public imo. Just don't take down websites, respect rate limits and do not follow 'no-follow' links.

To give an idea of the size of the issue, we have websites for customers that have maybe 5 hits per minute from actual users. Then _suddenly_ you go to 500 hits/minute for a couple of hours, because some bot is trying to scrape a calendar and is now looking for events in 1850 or whatever. (Not the greatest software that these links are still there tbh, but that is out of my control.)

Or another situation, not entirely related, but interesting i think: A few years back for days on end 80% of our total traffic came from random IPs across china & request could be traced through HTTP referrers where the 'user' had apparently opened a page in one province, then traveled to the other side of China and clicked a link 2 hours later.

All these things are relatively easy to mitigate, but that doesn't make it ethical.

3 comments

thekyle 1819 days ago

Just so you know, adding rel="nofollow" has never been intended to prevent bots from following those links. Even famous crawlers like Bingbot will sometimes follow, and index pages linked to by "nofollow" links.

The only thing that rel="nofollow" does is tell search engines not to use that link in their PageRank computation.

If you do want to block well-behaved crawlers from crawling parts of your site, the proper way to do that is to use robots.txt rules.

link

wu_187 1819 days ago

Exactly. We'd get customers thats sites would drown in bot traffic because they pulled the same shit, changing user agents, different ips, etc. I had to build custom mod security rules to block the patterns these boys would pull. What's funny is that the bots would have a site where you could control the crawl rates but it's just a placebo. They would crawl even if you requested them to stop.

The issue with bots hosted on AWS or any cloud for that matter is that as a web host you can't just block the IPs because legitimate traffic comes from them in the form of CMS plugins, backups, etc.

link

judge2020 1818 days ago

Note that you can report spam to AWS https://support.aws.amazon.com/#/contacts/report-abuse

link

twothamendment 1818 days ago

"... because some bot is trying to scrape a calendar and is now looking for events in 1850 or whatever."

Been there, done that - at least on the side of fixing it. Anyone who implements a calendar, don't make pervious and next links that let someone travel time forever.

I've always wondered how much bit traffic costs us - but never actually tried to figure it out. It is a good portion of our traffic - even when we block a lot of it.

link