Hacker News new | ask | show | jobs
by snowcode 1445 days ago
agreed; and it's the reason for this post. I have a cloudflare worker (code I've written) that tracks requests for robots.txt and then tracks if the requestor's session honours the robots allow and dissallow and forces the correct behaviour, i.e. robots that request files from a dissalow path my worker code returns 401. The problem is that I can't track bot behavior when good bots "continue" their session after requesting robots.txt (with a brief pause) because by then they probably (mostly do, according to my logs) make subsequent requests using IP addresses (or agents) from a pool, resulting in requests coming in from multiple IP addresses.

i.e. no way to manage or track bot sessions, and thus no way to monitor if they are adhering to the robots.txt allow/dissallow.

For now; the only way for me to be able to remove bad bots from my site, is to stick exclusively to bots that I can get a reliable set of IP addresses from; at this juncture that's just Google and Bing's bots, roughly, and then ignore the requests for robots.txt.

My suggestion, (this ycombinator post), is to have a reliable way of tracking robots, that's easy for low tech (amateur) bot builders to adhere to, and thus very efficiently block bad bot behaviour without punishing good bots, and allow site owners to be fully in control of pay for, and serving traffic to only the visitors they want on their sites.

update: forgot to say .. my code also blocks requests from bots that don't read robots.txt first. The tricky part of course is not accidentally tagging a real user, or a user using some genuine accesibility assistance tool as a bot. someone using curl for example; I dont mind a false negative and blocking his/her request, because if I want to make parts of my site available programmatically I would create an API for the bits that should be available programmatically, which currently is nothing.

1 comments

another update: my suggestion of privately sending bots you want to "allow" to scrape your site, would make it trivial to ban bad bots. No guid, no enter.

Bots that flood your site, effectively DDOS the site, are no different to hackers trying to DDOS your site and would be dealt with as an illegal hacking attempt .. not a "legal" bot configuration issue.

It instantly puts all traffic into 2 buckets, legal and illegal. Versus 3 buckets of legal, badbot and illegal.

tools like "scrapeshield" et al already exist, so my suggestion would allow those vendors to make their scrapeshields even more focused, and allow me to build my poor man's version of those tools, since I dont want to pay for anything. even if it's $5 a month, or $5 a year per domain for some anti-scraping that's a deal breaker for me, since I'm managing hundreds of domains.