| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by adrianoconnor 4978 days ago
	Do you have a robots.txt? That's the standard way.

5 comments

dumbfounder 4978 days ago

Calling them "scrapers" implies they are doing something nefarious (stealing content). Robots.txt is for law abiding bots.

link

ig1 4978 days ago

Not really. "Scraping" just refers to extracting data from a site using an automated method, it doesn't have any connotations about the motivation or acceptability of the process.

link

rli 4978 days ago

robots.txt can be ignored, it's just a reference for honest spiders. I think the way described above, of listing top requestors, doing statistics and then automating blocking is indeed the best way. Could also be there's a blocklist or two around of malicious scrapers. And if there isn't, that's a new business proposal.

link

lifeguard 4978 days ago

That is the way to block spiders that obey the standards and enough do not that robots.txt is not a solution.

link

bdcravens 4978 days ago

You'd end up blocking all traffic then. When was was the last time you pulled robots.txt?

link

halis 4978 days ago

Someone scraping your site may not respect robots.txt

link