|
|
|
|
|
by natebc
410 days ago
|
|
A few? We routinely are fighting off hundreds of bots at any moment. Thousands and Thousands per day, easily. US, China, Brazil from hundreds of different IPs, dozens of different (and falsified!) user agents all ignoring robots.txt and pushing over services that are needed by human beings trying to get work done. EDIT: Just checked our anubis stats for the last 24h CHALLENGE: 829,586 DENY: 621,462 ALLOW: 96,810 This is with a pretty aggressive "DENY" rule for a lot of the AI related bots and on 2 pretty small sites at $JOB. We have hundreds, if not thousands of different sites that aren't protected by Anubis (yet). Anubis and efforts like it are a xesend for companies that don't want to pay off Cloudflare or some other "security" company peddling a WAF. |
|
One is, suppose there are a thousand search engine bots. Then what you want is some standard facility to say "please give me a list of every resources on this site that has changed since <timestamp>" so they can each get a diff from the last time they crawled your site. Uploading each resource on the site to each of a thousand bots once is going to be irrelevant to a site serving millions of users (because it's a trivial percentage) and to a site with a small amount of content (because it's a small absolute number), which together constitute the vast majority of all sites.
The other is, there are aggressive bots that will try to scrape your entire site five times a day even if nothing has changed and ignore robots.txt. But then you set traps like disallowing something in robots.txt and then ban anything that tries to access it, which doesn't affect legitimate search engine crawlers because they respect robots.txt.