Hacker News new | ask | show | jobs
by zackify 3339 days ago
How would you get a letter from Google if you are never scraping google's sites? They would never know?
2 comments

I was trying to say, if for example, you were creating a competitive search engine to Google, but using Google's name to build that service, you'd be in trouble.

How would Google know? They would start by setting up fictitious websites which would be seemingly unaffiliated to them. If your crawler was to hit the site, you would thus reveal yourself. I wouldn't at all be surprised that Google would have this kind of "honey pot" of sorts sitting out there watching for web crawlers (rogue or otherwise).

Google likely also has business partner relationships with big content producers, which I'm sure they are able to get reports back from regarding their crawling -- to ensure that Google is correctly finding all the content which the site owners want them to.

As an aside, I used to run such a honeypot website. Web crawler behavior is fascinating. I loved being able to find, detect and classify various forms of web crawlers. Some which followed robots.txt, some that didn't, some that went directly to robots.txt and then scraped the pages which were meant to be excluded. I wish I had kept the project going and formalized the results.

Even if they do know, what do they have to do with it? Does google have a legal claim to their user-agent string exclusively?
If the name "googlebot" is trademarked, yes they would have a basis for a claim. It would at least be leverage they could use if they believed you were causing them harm in some way.