Hacker News new | ask | show | jobs
by franze 4976 days ago
an old meme, and my usual recommendation: just test it: create a page that i not linked from anywhere. visit it with the browsers mentioned above. watch the logfiles. wait for it. nope, no googlebot request. it is unbelievable easy to test, i have done so on various occasions in the past, so there is no need for you to spread a "several people have reported" rumor. just ... test ... it.

as for the old stories, that google does this kind of thing: people, especially SEOs or people who think they know SEO, always blame google. oh, my beta.site has been indexed, it must be because of ... google is evil.

most of the times i have seen cases where googlebot found a not published yet site it was because of (just some examples, not a complete list) i.e.:

* turned on error reporting (most of the PHP sites) * the URLs were already used in some javascript * server side analytics software, open to the public * apaches shows file/order structure * indexable logfiles * people linked to the site * somebody tweeted about it * site was covered on techcrunch (yes, really) * all visited URLs in the network were tracked by a firewall, the firewall published a log on an internal server, the internal server was reachable from the outside * internal wiki is indexable * intranet is indexable * concept paper is indexable

testing your hypothesis "chrome/google toolbar/... push URLs into the googlebot discovery queue, which leads to googlebot visits" is easily testable. no need to spread rumors. setup for testing this: make an html-page (30 seconds max, basically ssh to your server, create a file, write some html), tail & grep logfiles (30 sec max), wait (forever)

2 comments

It is a myth that is hard to get rid of. No one wants to admit they tweeted out a link to the dev website.

Though I recently found this on the Google+ FAQ: http://support.google.com/webmasters/bin/answer.py?hl=en&...

  When you add the +1 button to a page, Google assumes that
  you want that page to be publicly available and visible in
  Google Search results. As a result, we may fetch and show
  that page even if it is disallowed in robots.txt.
I can understand adding a +1 button to a dev site, and then not understanding why it shows up in the index.
Don't forget people who may have * installed UserScripts / GreaseMonkey scripts * Browser plugins other than Google Toolbar which may send stuff to the big G * (Self-)modded browsers which send out stuff to wherever...the list goes on and on indeed.

Best thing to do to keep a site secret: * Don't host it on the internet (d'uh) * Hide behind a portal page and have that and your server weed out misconfigured / hijacked browsers before any can proceed to your real secret site (also see web cloaking).