| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mtmail 460 days ago

Website owners create a robots.txt file to guide crawlers/scrapers. Some URLs might be slow (lots of data/database access) to generate, some URLs irrelevant to any search, some URLs might be outdated or duplicate. Or under copyright or other usage restriction.

Now we see the crawlers ignore the robots.txt.

Some crawlers don't do 1 request per second but hit a website with 100 per second. And for days. And crawl the same data again and again. It makes websites slow and has no immediate benefit to the humans.