Hacker News new | ask | show | jobs
by ColinHayhurst 2020 days ago
We are not aware that we have any problem like that.

This might explain why GigaBlast has a problem:

Because of bugs in the original Gigablast spidering code, the Findx crawler ended up on a blacklist in Project Honeypot as being “badly behaved” (fixed in our fork). That meant quite a bit of trouble for us because CDN providers, which are a very powerful hubs for internet traffic, put a lot of weight on this blacklist. Some of the most popular websites and services on the internet run through services like Cloudflare and other CDNs – so if you are in bad standing with them, suddenly a large part of the internet is not available, and we weren’t able index it.

extract from: https://web.archive.org/web/20190921180535/https://privacore...

3 comments

> fixed in our fork

Does this mean your spider is a fork of Gigablast? Is there some additional interesting technical information about how your code/infrastructure is set up?

I realise this is not addressing your second question but you might find it interesting. Post below on server expansion one year ago. We are adding another 100 servers over Christmas and early new year.

https://blog.mojeek.com/2019/12/100-server-build-and-install...

We'll be writing about our tech stack in our next FAQs series; 3 of 4, this is 1 of 4:

https://blog.mojeek.com/2020/11/frequently-asked-questions-a...

No, we have our own spider
The post was from Findx not Mojeek
any insight as to what is consider "Badly behaved crawlers"? Or is it something that you work out with CDN so they don't blacklist your ip?
Hello, I work on the technical side of Mojeek.

Mojeek follows the robots.txt protocol so if a site doesn't want to be crawled by MojeekBot we respect that wish. There is also a generous crawl delay between pages on the same host.

Generally a 'badly behaved bot' will ignore robots.txt or hit a site too hard with requests.

Our bot uses a specific user agent which you can verify via DNS. https://www.mojeek.com/bot.html

> There is also a generous crawl delay between pages on the same host.

What's the order of magnitude of this delay? milliseconds? hundreds of milliseconds? seconds? I'm curious what's considered 'polite' in this realm and how the various parties come to form opinions on this.

A minimum of 4 seconds.
I just had a look and there's a non-standard "crawl-delay directive" extension to robots.txt that can be used to ask a spider to take some time between page visits:

  User-agent: bingbot
  Allow : /
  Crawl-delay: 10
https://en.wikipedia.org/wiki/Robots_exclusion_standard#Craw...
Hello, MojeekBot doesn't observe the crawl-delay directive but thanks for the reminder of it as it's beneficial for us to know if site owners require more grace between requests.
Thanks for that link, I haven't come across it before.