Hacker News new | ask | show | jobs
by lucb1e 2588 days ago
Going one step further, I removed my websites from the Google index: https://lucb1e.com/?p=post&id=130

If you want to find my stuff, you now have to use DDG or some other search engine. Hopefully we can signal to Google that we are not okay with monopolistic behaviour (that's why I blocked them, not just for being the biggest or monopolist, but for also behaving like it -- see the blog post for details).

4 comments

I mean, that's great and all. But I actually want people to use my website.
So do I, but my livelihood does not depend on it, so I can safely do this and encourage people to use another search engine. Any other search engine. Even if you fall back to Google if you can't find something, at least use another search engine by default. I don't think that's too harsh to ask of a tech audience (which is the audience of my websites).

Of course, if your websites generate income based on people that came from google, I would not expect you to follow suit. As I mentioned in the post, the idea is that there are lots of resources on the web that are not there to turn a profit, but that are still valuable to people. Google is not listening when we use words, so I took action.

But... how does this encourage anyone to do anything? No one knows who you are or what your site is, and I'm sure Google couldn't care less. This sounds like the technical equivalent of someone refusing to drive on public roads because taxation is theft or something.
The "google couldn't care less" part I address in the post as well: I'm well aware that my site by itself will never have any significant impact.

As for how it encourages anyone to do anything: one site is not enough, but if a few people do this, the word starts to spread that you might need to try another search engine to find the more obscure things on the internet. People here seem to like the idea (judging by its positioning in the thread), and a lot of the tech community reads this. A few minutes ago, someone reached out via chat because they recognized my username. Those are the people that also make decisions at google or friends of people that make those decisions.

I understand where you're coming from, but I think it's more like voting than like your taxation comparison: your vote never matters, why bother? You don't have a voice in the government. None. But still, people vote. Collectively, we can make a difference.

Doesn't this ultimately make the situation worse since now less people will see that page in the first place?

i.e. the only people that will (likely) see it are the ones that are already doing what you want.

Would someone who wants to do this use Google? Even if you google for how to block google from your website, you are likely looking to block Google from pages like phpmyadmin, not all of their website. I don't think this blog post would ever be a relevant search result in Google. Maybe if someone saw this comment, tells a friend, and $friend googles it... It just seems like a remote chance.
From what I recall, Bing also displays AMP pages. Wouldn't you want to block crawls from Bing too?
I didn't know that, though Bing is not in enough of a position of power to abuse it. If they suddenly take over the vast majority of search queries from Google, this might become relevant, but that seems unlikely. For now, I'm happy if some competition is reintroduced in the search engine 'market'.
that's an awesome idea!

how do you determine which IPs are the google webcrawler, though?

I match the user agent string on containing "Google". It seems that Google Chrome only includes "Chrome", so I don't block users this way. Here is an overview of all Google crawlers' user agent strings: https://support.google.com/webmasters/answer/1061943 As you can see, they all include Google (capitalized).

I don't use robots.txt because they say that doesn't stop them from including the site in search results: https://support.google.com/webmasters/answer/6062608 I don't know if returning a HTTP 403 error will, but it seems like it's worth a try.

I also looked into banning IP ranges (that would have been my preferred option), but if I remember correctly they were subject to change and it seems overkill to write a scraper for that page that would then have to generate a config file and reload a service.

The documented way is the noindex tag (in html or http headers): https://support.google.com/webmasters/answer/93710?hl=en
Not all resources are HTML so I couldn't use the meta tag, but the header looks interesting! Reading up on it, it seems to achieve pretty much the same thing as my current solution. Would you say this is better for some reason? Nobody should encounter my server's 403 response except those with a Google user agent anyway.

The page doesn't say whether this works the same as the robots.txt disallow, where you may still appear in results because other pages link to you. The 403 might be more effective, but I can't really tell either way.

great point, but at that point you have to trust google with your data.

and if you are taking this action in protest, you probably don't.

if you don't serve the data at all, whether or not to respect your "noindex" is imposed on google, rather than being a suggestion (like "do not track" in the chrome browser; we all know how that turned out)

No need, you can block them by user agent, they are consistent - or just disallow them in robots.txt.

If you need to make sure it's actually a google bot when a client shows up with the user agent, you can use reverse dns: https://support.google.com/webmasters/answer/80553