Hacker News new | ask | show | jobs
by deepstack 2019 days ago
any insight as to what is consider "Badly behaved crawlers"? Or is it something that you work out with CDN so they don't blacklist your ip?
1 comments

Hello, I work on the technical side of Mojeek.

Mojeek follows the robots.txt protocol so if a site doesn't want to be crawled by MojeekBot we respect that wish. There is also a generous crawl delay between pages on the same host.

Generally a 'badly behaved bot' will ignore robots.txt or hit a site too hard with requests.

Our bot uses a specific user agent which you can verify via DNS. https://www.mojeek.com/bot.html

> There is also a generous crawl delay between pages on the same host.

What's the order of magnitude of this delay? milliseconds? hundreds of milliseconds? seconds? I'm curious what's considered 'polite' in this realm and how the various parties come to form opinions on this.

A minimum of 4 seconds.
I just had a look and there's a non-standard "crawl-delay directive" extension to robots.txt that can be used to ask a spider to take some time between page visits:

  User-agent: bingbot
  Allow : /
  Crawl-delay: 10
https://en.wikipedia.org/wiki/Robots_exclusion_standard#Craw...
Hello, MojeekBot doesn't observe the crawl-delay directive but thanks for the reminder of it as it's beneficial for us to know if site owners require more grace between requests.
Hey. Good job with Mojeek. It seems the crawl-delay directive is not part of the robots.txt standard. It probably should be but that's not up to you!