| HN Mirror

I was trying to say, if for example, you were creating a competitive search engine to Google, but using Google's name to build that service, you'd be in trouble.

How would Google know? They would start by setting up fictitious websites which would be seemingly unaffiliated to them. If your crawler was to hit the site, you would thus reveal yourself. I wouldn't at all be surprised that Google would have this kind of "honey pot" of sorts sitting out there watching for web crawlers (rogue or otherwise).

Google likely also has business partner relationships with big content producers, which I'm sure they are able to get reports back from regarding their crawling -- to ensure that Google is correctly finding all the content which the site owners want them to.

As an aside, I used to run such a honeypot website. Web crawler behavior is fascinating. I loved being able to find, detect and classify various forms of web crawlers. Some which followed robots.txt, some that didn't, some that went directly to robots.txt and then scraped the pages which were meant to be excluded. I wish I had kept the project going and formalized the results.