Hacker News new | ask | show | jobs
by appleflaxen 2587 days ago
that's an awesome idea!

how do you determine which IPs are the google webcrawler, though?

2 comments

I match the user agent string on containing "Google". It seems that Google Chrome only includes "Chrome", so I don't block users this way. Here is an overview of all Google crawlers' user agent strings: https://support.google.com/webmasters/answer/1061943 As you can see, they all include Google (capitalized).

I don't use robots.txt because they say that doesn't stop them from including the site in search results: https://support.google.com/webmasters/answer/6062608 I don't know if returning a HTTP 403 error will, but it seems like it's worth a try.

I also looked into banning IP ranges (that would have been my preferred option), but if I remember correctly they were subject to change and it seems overkill to write a scraper for that page that would then have to generate a config file and reload a service.

The documented way is the noindex tag (in html or http headers): https://support.google.com/webmasters/answer/93710?hl=en
Not all resources are HTML so I couldn't use the meta tag, but the header looks interesting! Reading up on it, it seems to achieve pretty much the same thing as my current solution. Would you say this is better for some reason? Nobody should encounter my server's 403 response except those with a Google user agent anyway.

The page doesn't say whether this works the same as the robots.txt disallow, where you may still appear in results because other pages link to you. The 403 might be more effective, but I can't really tell either way.

great point, but at that point you have to trust google with your data.

and if you are taking this action in protest, you probably don't.

if you don't serve the data at all, whether or not to respect your "noindex" is imposed on google, rather than being a suggestion (like "do not track" in the chrome browser; we all know how that turned out)

No need, you can block them by user agent, they are consistent - or just disallow them in robots.txt.

If you need to make sure it's actually a google bot when a client shows up with the user agent, you can use reverse dns: https://support.google.com/webmasters/answer/80553