Hacker News new | ask | show | jobs
by jefftk 2501 days ago
robots.txt is a tool to control crawling, not to specify how you would like your site to be displayed (or not) in search results. If you don't want search engines to include your site, set:

    <meta name="robots" content="noindex">
while to block just Google do:

    <meta name="googlebot" content="noindex">
See https://support.google.com/webmasters/answer/93710

If Googlebot is not respecting robots.txt, and is crawling something it's been instructed not to crawl, let me know and I can file a bug?

(Disclosure: I work for Google but not on Search, speaking only for myself)

2 comments

But that requires that Googlebot be allowed to crawl the page in robots.txt in the first place.

How do you tell Googlebot to not crawl your site and to not index it either?

Previously, one could use the undocumented "Noindex" directive in robots.txt, but this will be disabled soon: https://webmasters.googleblog.com/2019/07/a-note-on-unsuppor...

The bot doesn't need to crawl your site for it to be indexed; it crawls other sites that link to yours.

You can specify your index preferences in Webmaster Tools. Don't know if there's a domain-wide off switch in there, but there probably is.

Using Webmaster Tools is not a good option since it requires you register with the exact company you are probably trying to not interact with.
The blog post you link has a bunch of alternatives, but I agree they're not great. If there are a lot of webmasters who want to be able to noindex through robots.txt then making the case for adding noindex to the standard would be a good next step.

(Still speaking only for myself)

Googlebot actually used to support a noindex rule in robots.txt, but they are removing it.

https://webmasters.googleblog.com/2019/07/a-note-on-unsuppor...

Yes, that was linked above. It looks like this is part of reducing support to what's in the spec?
Oops, yep. I didn't see that context.
I sent you an email, and I'm posting it here but without identifying info:

---

Hi Jeff,

Thank you for your comment, I'm replying via email to send some info I'd rather not share on HN, but will post the same redacted in HN. I used to (back when starting my web-dev career) run a one man show development team of a web agency and all our development/pre-prod sites (that had to be unauthed) had robots.txt to disallow all bots, but they still popped up in Google. Searching some of the old domains in google I found an example here: http://***.***/***, and attached is an example of it showing up in a SERP and a what the robots.txt looks like (and I'm pretty sure that the robots.txt has looked like that since that page was created).

In this case it is just one page that nobody will care about, and since I'm not working on projects that are open but "robots.txt hidden" anymore I don't know if it is as bad as it used to be, but I regularly see pages with the "No information is available for this page" whose domains have robots.txt's that disallow all bots but still show up in Google.

Please let me know if I missed anything :)