Hacker News new | ask | show | jobs
by tszming 4981 days ago
Common misinterpretation on how Google handle `Disallow` in robots.txt

Q. If I block Google from crawling a page using a robots.txt disallow directive, will it disappear from search results? [1]

robots.txt Disallow does not guarantee that a page will not appear in results: Google may still decide, based on external information such as incoming links, that it is relevant. If you wish to explicitly block a page from being indexed, you should instead use the noindex robots meta tag or X-Robots-Tag HTTP header. In this case, you should not disallow the page in robots.txt, because the page must be crawled in order for the tag to be seen and obeyed.

[1] https://developers.google.com/webmasters/control-crawl-index...

5 comments

We develop and host a bunch of extranets, which without login consist of your typical authentication page. We put a robots.txt file there, and the only sites that link there are our customers companies home sites.

Google still indexes them. The definition of "relevant" here defies my wildest imagination.

robots.txt is not about indexing. It's about crawling.
"In this case, you should not disallow the page in robots.txt"

But don't worry, we'll ignore the information in robots.txt anyway, so maybe it's better to have both information there.

And maybe if it's relevant they'll ignore the X-Robots-Tag as well.

" Common misinterpretation on how Google handle `Disallow` in robots.txt"

Here is why I think this happened: http://www.facebook.com/humans.txt

;)

Here's a video that explains how and why we handle robots.txt that way: http://www.mattcutts.com/blog/robots-txt-remove-url/
You can do access control on the contents of HTTP_REFERER: if the browser visits a page in your robots.txt by following a Google link, serve them up a 403 forbidden. (In Apache 2.4, this can all be done using mod_authz_core.)

You could maybe say in your 403 forbidden message that Google has been forbidden from indexing the page (use ErrorDocument). If enough sites did that, Google might change their policy.

Google's default for logged in users is to use https and strip searched phrases when leaving SERP, so HTTP_REFERER will be empty. A lot of security software also cuts HTTP_REFERER. Being behind proxy may cause it to be empty, too. In general, I don't think you can rely on headers sent by the the browser. You don't know if they are real or forged.