| We need a better robots.txt standard … that tech companies (and bot writers) should be required to adhere to; Here's my simple suggestion; Stopping a huge amount of bad bot behaviour surely could be a really simple thing to do? .. I believe it is, and here’s how … * Bots should have to start a session by requesting robots.txt before any interaction with a website;
that robots file should optionally include a correlationID / sessionID for the duration of any bot scraping, so that subsequent requests arriving from new IP addresses can be correlated back to the same bot session. This would dramatically assist in identifying good vs bad "bot" behaviour. Especially considering that "good" /polite bot behaviour, by definition is hard to monitor, since requests may come in a few seconds apart from each other, and paused and resumed over time to deliberately avoid swamping/over working servers. The result is that really good bots ... are hard to tag as Good bots. * All bots shouid also have a standardised API ( which should be advertised in their request headers ) where you can make a callback to the API on a well known branded domain relating to the search engine/bot / service, where you can submit a token GUID that the bot can be required to present during all future crawls. ...since crawls would be done over https, this would be a simple enough mechanism to easily identify bot impersonators without requiring the bots to be limited to making requests from a known DNS domain. ... with a token becoming effective within a reasonable period of say 3 to 5 minutes. i.e. enough time for any distributed cache to be updatable and any bot running an existing crawl should not start a crawl session lasting longer than the same period, or if it does should check the token-is-required cache within a period no longer than the same 3 to 5 minutes. whatever is deemed pratical at today's cloud scale. What do you think? p.s. my thoughts are based on my experience of code I've written for running on the cheap, on edge computing “CloudFlare workers”; which presents an interesting challenge of how to do this without access to distributed caching etc. (yes they do exist at edge, but can't be used without increasing latency or costs being worse than the traffic you’re trying to block. So it's just about what can easily be done on the cheap; to block say 80% of bad traffic for like 1% of effort) My code doesn’t have to be perfect, just effective. |
If it’s public, and you declare certain content not to be scraped, it will still be scraped in some cases. So don’t rely on it.
It’s a guideline only.