Hacker News new | ask | show | jobs
by oozcitak 5465 days ago
> How does anyone possibly allow a data dump to come anywhere near somewhere Google could index it?

Let me just put this sql dump in the web root for a couple hours to copy over to the test server.

1 comments

Google also has to know that the file is there, before indexing it, either from a link available to Google, or from the website's sitemap, or by activating directory listing in Apache, or some other shit like that.
This is very simple. If you're using Chrome Browser, ChromeOS, or Google Toolbar, then google is using their pagerank tech ... or essentially sending the url you type into the browser to their servers for ranking purposes. If you can access it freely on the net, assume it is already indexed, even if there are no links to it.
Is this true? Can you (or anyone) point to some kind of reference or evidence? If valid, I'd consider this an almost dangerous breach of privacy.
As requested:

http://en.wikipedia.org/wiki/PageRank#The_intentional_surfer...

> The Google toolbar sends information to Google for every page visited, and thereby provides a basis for computing PageRank based on the intentional surfer model.

For it to display a pagerank, it has to send the url to Google (otherwise, how is it going to know what to display the rank for?). Google can then send the crawler to that address later.

> If valid, I'd consider this an almost dangerous breach of privacy.

I don't believe they monitor who is going where. Just where people are going. Although it would be trivial for them to monitor who is going where ...

Also, an FYI, if you are logged in to Google and you're using their search engine, then they ARE monitoring you. Check out Google Web History.

Thanks for the link.

I was concerned more with content indexing of URLs that are not meant to be public, to the point where that content could show in search results. Imagine my editor emails me a link to a blog article for approval before publishing. Or, as a designer, you create a draft of a web page to show to your client; and for the convenience of said client, you prefer not to have it password protected (nor take the time to set it up - you have enough to do!)

In both cases, imagine that someone loads the URL in their Chrome browser. If that action resulted in the URL being added to the googlebot's itinerary, even though no publicly visible webpage links to it, the result could be the exposure of information that we don't want. Or for the blog post example, it could even affect SEO by causing a duplicate content penalty.

Of course we can password protect the page, exclude the urls in robots.txt, etc. But there is a labor cost and inconvenience to having to do that, and there is always risk that something would slip through.

That said, what I write above is likely pure speculation; I don't know of any evidence that Google is actually doing this, and it seems unlikely to me that they would.

Directory listing was on. Searching google's cache for http://www.sosasta.com/uploaded/ will confirm as much.
Searching on cache:http://www.sosasta.com/uploaded/ doesn't show a result now from here (Australia.)

Also, searching on link:http://www.sosasta.com/uploaded/ doesn't show anyone linking to it. Even if the directory is there, it had to get there in the first place somehow.

To: mycolleague@gmail.com

The database dump is here: http://www.secretserver.com/database.sql.gz

Don't tell anyone.

You really think google adds private emails to its public index? Get real.
If the URL is a publicly accessible webserver with no robots.txt telling them to stay off it, I wouldn't be surprised if it gets fed to the crawler.
Google can even index files listed in robots.txt (it just doesn't crawl them)
The email wasn't indexed but (maybe) a link in the email was.

Google honours robots.txt, X-Robots headers etc but everything else is fair game.

They scan my private emails to target advertising at me - why would they not follow links (as the link obviously denotes something I'm interested in)?

And, as the others state, if it's not robots.txt denied then why not add it to the public index?

A company I worked for had a clever script that automatically put all content on the server into a Google sitemap.