Hacker News new | ask | show | jobs
A better robots.txt standard; heres my suggestion
3 points by snowcode 1450 days ago
We need a better robots.txt standard … that tech companies (and bot writers) should be required to adhere to; Here's my simple suggestion;

Stopping a huge amount of bad bot behaviour surely could be a really simple thing to do? .. I believe it is, and here’s how …

* Bots should have to start a session by requesting robots.txt before any interaction with a website; that robots file should optionally include a correlationID / sessionID for the duration of any bot scraping, so that subsequent requests arriving from new IP addresses can be correlated back to the same bot session.

This would dramatically assist in identifying good vs bad "bot" behaviour. Especially considering that "good" /polite bot behaviour, by definition is hard to monitor, since requests may come in a few seconds apart from each other, and paused and resumed over time to deliberately avoid swamping/over working servers.

The result is that really good bots ... are hard to tag as Good bots.

* All bots shouid also have a standardised API ( which should be advertised in their request headers ) where you can make a callback to the API on a well known branded domain relating to the search engine/bot / service, where you can submit a token GUID that the bot can be required to present during all future crawls.

...since crawls would be done over https, this would be a simple enough mechanism to easily identify bot impersonators without requiring the bots to be limited to making requests from a known DNS domain.

... with a token becoming effective within a reasonable period of say 3 to 5 minutes. i.e. enough time for any distributed cache to be updatable and any bot running an existing crawl should not start a crawl session lasting longer than the same period, or if it does should check the token-is-required cache within a period no longer than the same 3 to 5 minutes. whatever is deemed pratical at today's cloud scale.

What do you think?

p.s. my thoughts are based on my experience of code I've written for running on the cheap, on edge computing “CloudFlare workers”; which presents an interesting challenge of how to do this without access to distributed caching etc. (yes they do exist at edge, but can't be used without increasing latency or costs being worse than the traffic you’re trying to block. So it's just about what can easily be done on the cheap; to block say 80% of bad traffic for like 1% of effort) My code doesn’t have to be perfect, just effective.

2 comments

Keep in mind bots don’t always honour robots.txt

If it’s public, and you declare certain content not to be scraped, it will still be scraped in some cases. So don’t rely on it.

It’s a guideline only.

agreed; and it's the reason for this post. I have a cloudflare worker (code I've written) that tracks requests for robots.txt and then tracks if the requestor's session honours the robots allow and dissallow and forces the correct behaviour, i.e. robots that request files from a dissalow path my worker code returns 401. The problem is that I can't track bot behavior when good bots "continue" their session after requesting robots.txt (with a brief pause) because by then they probably (mostly do, according to my logs) make subsequent requests using IP addresses (or agents) from a pool, resulting in requests coming in from multiple IP addresses.

i.e. no way to manage or track bot sessions, and thus no way to monitor if they are adhering to the robots.txt allow/dissallow.

For now; the only way for me to be able to remove bad bots from my site, is to stick exclusively to bots that I can get a reliable set of IP addresses from; at this juncture that's just Google and Bing's bots, roughly, and then ignore the requests for robots.txt.

My suggestion, (this ycombinator post), is to have a reliable way of tracking robots, that's easy for low tech (amateur) bot builders to adhere to, and thus very efficiently block bad bot behaviour without punishing good bots, and allow site owners to be fully in control of pay for, and serving traffic to only the visitors they want on their sites.

update: forgot to say .. my code also blocks requests from bots that don't read robots.txt first. The tricky part of course is not accidentally tagging a real user, or a user using some genuine accesibility assistance tool as a bot. someone using curl for example; I dont mind a false negative and blocking his/her request, because if I want to make parts of my site available programmatically I would create an API for the bits that should be available programmatically, which currently is nothing.

another update: my suggestion of privately sending bots you want to "allow" to scrape your site, would make it trivial to ban bad bots. No guid, no enter.

Bots that flood your site, effectively DDOS the site, are no different to hackers trying to DDOS your site and would be dealt with as an illegal hacking attempt .. not a "legal" bot configuration issue.

It instantly puts all traffic into 2 buckets, legal and illegal. Versus 3 buckets of legal, badbot and illegal.

tools like "scrapeshield" et al already exist, so my suggestion would allow those vendors to make their scrapeshields even more focused, and allow me to build my poor man's version of those tools, since I dont want to pay for anything. even if it's $5 a month, or $5 a year per domain for some anti-scraping that's a deal breaker for me, since I'm managing hundreds of domains.

more thoughts; Invert allow/disallow

Robots.txt should be inverted, with everything disallowed by default, and only allowed paths to be configured. Requiring site owners to explicitly list dissalowed secret folders is ludicrous. How this became de-facto is hard to fathom. If a site owner wants to allow everything then he/she should have to explicitly add a global allow.