| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by whalesalad 4959 days ago

This reminds me of a recent experience I had with the Bing bot.

This most recent YC round, my co-founder and I used Skydrive to edit our application. Skydrive integrates pretty nicely with Word, even on a Mac, to allow for collaborative editing. It's like the best parts of Sharepoint, minus all the crap, and inside of a modern UI. I'm a diehard Apple user, but I also subscribe to the "right tool for the job" principle ... in this case it worked pretty well.

Anyway, inside the document were links to some private areas of our website that contained demo materials for YC. As requested, they were not password protected, but also not linked from anywhere else. While submitting I ensured that our nginx logs would capture visits to these URL's in a separate log, so we'd know when it was being looked at (sidenote, seeing visitors coming from inside justin.tv + the rincon hill towers is kind of exhilarating).

What surprised me was that almost immediately after we began working on the document, the Bing bot was going apeshit exploring the domain and the 'private' URL's. I had to quickly add a robots.txt to deny all on the root. I thought it was pretty interesting. At first I felt almost violated. But then it seems logical that they'd be indexing every URL in every document stored in their datacenter, why not?

3 comments

rpm4321 4959 days ago

Eh, I'm pretty sure you should still feel violated. The fact that they are parsing your private documents for information that they can use to help another business unit is really sketchy. It would make me wonder what else they are scanning my data for.

Personally, I'll never use an MS cloud service because of this anecdote - not that it was that likely to begin with.

link

jacquesm 4959 days ago

I'd feel extremely violated.

I use google docs, very sparingly. One of the spreadsheets there contains a URL that is not linked from anywhere else and impossible to guess. If that URL ever gets tripped it will send me an email and the day that happens is the day I'll stop using google services (so far so good, and of course I should say 'google drive' now instead of 'google docs').

link

tekromancr 4959 days ago

How would you know it was google indexing your document vs., say, your browser prefetching the link?

link

jacquesm 4959 days ago

Because my browser has never looked at the document with the link in it. Obviously that would defeat the purpose.

link

eli 4959 days ago

You assume they were indexing skydrive documents. It could well be that one of the people who visited the link had a Bing toolbar installed.

Either way, all publicly accesible documents will get indexed sooner or later.

link

whalesalad 4959 days ago

This was before the document had been sent to anyone. It was still being edited, only my friend and I were working on it. Also, the documents were not public.

link

eli 4959 days ago

I would be surprised if Microsoft is intentionally indexing links in private documents, but my point stands: Google et al are remarkably good at indexing the web. If you don't want an otherwise public URL indexed you must use robots.txt or equivalent.

link

drivebyacct2 4958 days ago

>If you don't want an otherwise public URL indexed you must use robots.txt or equivalent.

Which only blocks bots that respect the file...

link

rpm4321 4959 days ago

You may be right, but I can't help but smirk at the thought of PG or Buchheit downloading and installing the Bing Toolbar ;)

link