| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by oozcitak 5512 days ago
	> How does anyone possibly allow a data dump to come anywhere near somewhere Google could index it? Let me just put this sql dump in the web root for a couple hours to copy over to the test server.

1 comments

bad_user 5512 days ago

Google also has to know that the file is there, before indexing it, either from a link available to Google, or from the website's sitemap, or by activating directory listing in Apache, or some other shit like that.

link

maratd 5512 days ago

This is very simple. If you're using Chrome Browser, ChromeOS, or Google Toolbar, then google is using their pagerank tech ... or essentially sending the url you type into the browser to their servers for ranking purposes. If you can access it freely on the net, assume it is already indexed, even if there are no links to it.

link

redsymbol 5512 days ago

Is this true? Can you (or anyone) point to some kind of reference or evidence? If valid, I'd consider this an almost dangerous breach of privacy.

link

maratd 5512 days ago

As requested:

http://en.wikipedia.org/wiki/PageRank#The_intentional_surfer...

> The Google toolbar sends information to Google for every page visited, and thereby provides a basis for computing PageRank based on the intentional surfer model.

For it to display a pagerank, it has to send the url to Google (otherwise, how is it going to know what to display the rank for?). Google can then send the crawler to that address later.

> If valid, I'd consider this an almost dangerous breach of privacy.

I don't believe they monitor who is going where. Just where people are going. Although it would be trivial for them to monitor who is going where ...

Also, an FYI, if you are logged in to Google and you're using their search engine, then they ARE monitoring you. Check out Google Web History.

link

redsymbol 5512 days ago

Thanks for the link.

I was concerned more with content indexing of URLs that are not meant to be public, to the point where that content could show in search results. Imagine my editor emails me a link to a blog article for approval before publishing. Or, as a designer, you create a draft of a web page to show to your client; and for the convenience of said client, you prefer not to have it password protected (nor take the time to set it up - you have enough to do!)

In both cases, imagine that someone loads the URL in their Chrome browser. If that action resulted in the URL being added to the googlebot's itinerary, even though no publicly visible webpage links to it, the result could be the exposure of information that we don't want. Or for the blog post example, it could even affect SEO by causing a duplicate content penalty.

Of course we can password protect the page, exclude the urls in robots.txt, etc. But there is a labor cost and inconvenience to having to do that, and there is always risk that something would slip through.

That said, what I write above is likely pure speculation; I don't know of any evidence that Google is actually doing this, and it seems unlikely to me that they would.

link

mqzaidi 5512 days ago

Directory listing was on. Searching google's cache for http://www.sosasta.com/uploaded/ will confirm as much.

link

angusgr 5512 days ago

Searching on cache:http://www.sosasta.com/uploaded/ doesn't show a result now from here (Australia.)

Also, searching on link:http://www.sosasta.com/uploaded/ doesn't show anyone linking to it. Even if the directory is there, it had to get there in the first place somehow.

link

rahoulb 5512 days ago

To: mycolleague@gmail.com

The database dump is here: http://www.secretserver.com/database.sql.gz

Don't tell anyone.

link

ctz 5512 days ago

You really think google adds private emails to its public index? Get real.

link

angusgr 5512 days ago

If the URL is a publicly accessible webserver with no robots.txt telling them to stay off it, I wouldn't be surprised if it gets fed to the crawler.

link

eli 5512 days ago

Google can even index files listed in robots.txt (it just doesn't crawl them)

link

andymurd 5512 days ago

The email wasn't indexed but (maybe) a link in the email was.

Google honours robots.txt, X-Robots headers etc but everything else is fair game.

link

rahoulb 5510 days ago

They scan my private emails to target advertising at me - why would they not follow links (as the link obviously denotes something I'm interested in)?

And, as the others state, if it's not robots.txt denied then why not add it to the public index?

link

PonyGumbo 5512 days ago

A company I worked for had a clever script that automatically put all content on the server into a Google sitemap.

link