| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by knuckleheads 2008 days ago

> Do any site operators actually block non-Google search engine crawlers because being listed DDG/Bing/etc isn't worth the extra cost of serving the crawler?

Many website operators do actually block crawlers from non Google search engines and it's because the cost of being crawled isn't worth it to them. Here's a good quote from one such webmaster:

    As a webmaster I get a bit tired of constantly having to deal with the startup crawler du jour.

    From law firms looking for DMCA violations to verticals search engines, to image aggregators, to company intelligence resellers… It feels to me that everybody and their brother has gotten into spidering sites.

    With 10,000s of pages that have content that is only relevant to a targeted audience who is perfectly able to find us on the majors, I do not hesitate to block (and possibly ban) when I see an aggressive crawler that does not provide me or my customers with direct benefits.

Taken from http://www.skrenta.com/2008/04/cuill_is_banned_on_10000_site...

> Perhaps other search engines should spoof GoogleBot. Browsers have being doing that since forever spoofing Netscape (Mozilla), Safari, etc. for the same reason.

People have tried this and it doesn't work. Google provides ways to check to make sure traffic is coming from Google IP addresses and practitioners and academics study how to spot fake Googlebots. https://developers.google.com/search/docs/advanced/crawling/... https://blogs.akamai.com/2014/07/search-engine-impersonation... https://ieeexplore.ieee.org/document/8421894

> This sounds like a common fallacy of people criticizing the free market.

I am asserting that crawling the web is a natural monopoly. This means that the free market has failed and that it is not possible for the market to heal itself in this regard. There is significant evidence that this is the case and I imagine you'll be hearing more and more about it soon.