| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by tszming 4981 days ago

Common misinterpretation on how Google handle `Disallow` in robots.txt

Q. If I block Google from crawling a page using a robots.txt disallow directive, will it disappear from search results? [1]

robots.txt Disallow does not guarantee that a page will not appear in results: Google may still decide, based on external information such as incoming links, that it is relevant. If you wish to explicitly block a page from being indexed, you should instead use the noindex robots meta tag or X-Robots-Tag HTTP header. In this case, you should not disallow the page in robots.txt, because the page must be crawled in order for the tag to be seen and obeyed.

[1] https://developers.google.com/webmasters/control-crawl-index...

5 comments

lloeki 4981 days ago

We develop and host a bunch of extranets, which without login consist of your typical authentication page. We put a robots.txt file there, and the only sites that link there are our customers companies home sites.

Google still indexes them. The definition of "relevant" here defies my wildest imagination.

link

kevinpet 4981 days ago

robots.txt is not about indexing. It's about crawling.

link

raverbashing 4981 days ago

"In this case, you should not disallow the page in robots.txt"

But don't worry, we'll ignore the information in robots.txt anyway, so maybe it's better to have both information there.

And maybe if it's relevant they'll ignore the X-Robots-Tag as well.

link

tep 4981 days ago

" Common misinterpretation on how Google handle `Disallow` in robots.txt"

Here is why I think this happened: http://www.facebook.com/humans.txt

;)

link

Matt_Cutts 4981 days ago

Here's a video that explains how and why we handle robots.txt that way: http://www.mattcutts.com/blog/robots-txt-remove-url/

link

chalst 4981 days ago

You can do access control on the contents of HTTP_REFERER: if the browser visits a page in your robots.txt by following a Google link, serve them up a 403 forbidden. (In Apache 2.4, this can all be done using mod_authz_core.)

You could maybe say in your 403 forbidden message that Google has been forbidden from indexing the page (use ErrorDocument). If enough sites did that, Google might change their policy.

link

vitalique 4981 days ago

Google's default for logged in users is to use https and strip searched phrases when leaving SERP, so HTTP_REFERER will be empty. A lot of security software also cuts HTTP_REFERER. Being behind proxy may cause it to be empty, too. In general, I don't think you can rely on headers sent by the the browser. You don't know if they are real or forged.

link