Hacker News new | ask | show | jobs
by armchairhacker 460 days ago
It's (usually) not about restricting access. Many AI scrapers are poorly implemented: they overload servers with requests, effectively DOS-ing the server and preventing anyone (including normal users and even themselves) from accessing it.

Many of the AI scraper's requests also point to non-existent, redundant, or low-quality destinations. Websites provide a file, /robots.txt, that clearly indicate what URLs crawlers should and should not visit; but the AI scrapers ignore robots.txt, visiting any URL they find, and some they invent (which, naturally, turn out to be non-existent). Websites also indicate when the content at a specific URL has or may change; but AI scrapers ignore those indicators too, requesting the same URL for a static webpage sometimes seconds apart.

https://blog.cloudflare.com/ai-labyrinth/ specifies that it works against "unauthorized crawling" and "inappropriate bot activity". I assume that a scraper (even an AI scraper) that respects robots.txt and doesn't send requests at unreasonable rates won't encounter the AI labyrinth.

1 comments

Right. I don't care what scrapes my site, but I get tired of getting tens of thousands of hits in an hour, coming from an array of IPs at AWS or other cloud providers so they can all hammer my little server at once. I block them at the firewall when it happens so it doesn't take down the web server, but it's tiresome to have to do that.