| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by aleph_naught 2045 days ago
	> since no one other than Google is really allowed to crawl the web. ??

1 comments

knuckleheads 2045 days ago

There are two main reasons why I say nobody besides Google is really allowed to crawl the web.

The first is that Google gets much more access to pages on websites than everybody else. You can see this by examining the robots.txt files of various websites[0]. I've been doing this for several years now and Google has a consistent advantage across many thousands websites that I've looked at. This adds up to a significant advatnage and many search engine operators complain about how it hampers their ability to compete with Google[1].

The second is that Google gets to ignore crawl delay directive in robots.txt while other search engines don't[2]. Website operators cannot tell Google how fast they want their website crawled, they can only request that Google slow down. If another search engine tried to do what Google does, they would likely be blocked by many important websites.

If you would like to read more about this, please checkout https://knuckleheads.club/

[0] https://pdf.sciencedirectassets.com/robots.txt

[1] https://www.nytimes.com/2020/12/14/technology/how-google-dom...

[2] https://www.seroundtable.com/google-noindex-in-robots-txt-de...

link

grishka 2045 days ago

So, uh, don't respect robots.txt in your search engine? It's not like there's a law that you have to, and that you can't pretend you are Googlebot. The only real obstacle I can imagine is that some firewalls might be configured to be more permissive with traffic originating from Google subnets.

link

knuckleheads 2045 days ago

You would be blocked fairly quickly by many website operators and no longer able to access those websites if you straight up ignored robots.txt files. You also might even end up being served cease and desists by some websites and sued if you continue to persist and try to find ways around it.

link

grishka 2045 days ago

And what if you do respect it but follow Googlebot rules?

link

knuckleheads 2045 days ago

Applebot was able to get away with doing exactly this but I imagine that's because it's Apple and websites knew that Apple was about to send them enough traffic via Apple News to make it worth their while. I don't know if other search engine operators have tried this but I would imagine they would get caught by rate limiters set for non Google IP's and then they would be blocked.

link

grishka 2045 days ago

Still, you keep saying all that as if most websites even notice that they're being crawled, and that their operators are very aware exactly when by whom they're crawled. Like as if the admin gets a notification every time a crawler comes by or something, with precise details about it. I don't think it's nearly as serious as you're trying to make it look.

link

oh_sigh 2045 days ago

Google tells website operators how to verify google bots in a way that can't be spoofed.

link

ehnto 2045 days ago

It's a situation where the rules seem obvious but the practical realities of it mean Google has the advantage by being the incumbent. No one would dare block Google for a search traffic reliant business, but some upstart search engine will quickly end up on blacklists even with reasonably slow crawling.

link