Hacker News new | ask | show | jobs
by adrianoconnor 4931 days ago
Do you have a robots.txt? That's the standard way.
5 comments

Calling them "scrapers" implies they are doing something nefarious (stealing content). Robots.txt is for law abiding bots.
Not really. "Scraping" just refers to extracting data from a site using an automated method, it doesn't have any connotations about the motivation or acceptability of the process.
robots.txt can be ignored, it's just a reference for honest spiders. I think the way described above, of listing top requestors, doing statistics and then automating blocking is indeed the best way. Could also be there's a blocklist or two around of malicious scrapers. And if there isn't, that's a new business proposal.
That is the way to block spiders that obey the standards and enough do not that robots.txt is not a solution.
You'd end up blocking all traffic then. When was was the last time you pulled robots.txt?
Someone scraping your site may not respect robots.txt