Hacker News new | ask | show | jobs
by whalesalad 4912 days ago
This reminds me of a recent experience I had with the Bing bot.

This most recent YC round, my co-founder and I used Skydrive to edit our application. Skydrive integrates pretty nicely with Word, even on a Mac, to allow for collaborative editing. It's like the best parts of Sharepoint, minus all the crap, and inside of a modern UI. I'm a diehard Apple user, but I also subscribe to the "right tool for the job" principle ... in this case it worked pretty well.

Anyway, inside the document were links to some private areas of our website that contained demo materials for YC. As requested, they were not password protected, but also not linked from anywhere else. While submitting I ensured that our nginx logs would capture visits to these URL's in a separate log, so we'd know when it was being looked at (sidenote, seeing visitors coming from inside justin.tv + the rincon hill towers is kind of exhilarating).

What surprised me was that almost immediately after we began working on the document, the Bing bot was going apeshit exploring the domain and the 'private' URL's. I had to quickly add a robots.txt to deny all on the root. I thought it was pretty interesting. At first I felt almost violated. But then it seems logical that they'd be indexing every URL in every document stored in their datacenter, why not?

3 comments

Eh, I'm pretty sure you should still feel violated. The fact that they are parsing your private documents for information that they can use to help another business unit is really sketchy. It would make me wonder what else they are scanning my data for.

Personally, I'll never use an MS cloud service because of this anecdote - not that it was that likely to begin with.

I'd feel extremely violated.

I use google docs, very sparingly. One of the spreadsheets there contains a URL that is not linked from anywhere else and impossible to guess. If that URL ever gets tripped it will send me an email and the day that happens is the day I'll stop using google services (so far so good, and of course I should say 'google drive' now instead of 'google docs').

How would you know it was google indexing your document vs., say, your browser prefetching the link?
Because my browser has never looked at the document with the link in it. Obviously that would defeat the purpose.
You assume they were indexing skydrive documents. It could well be that one of the people who visited the link had a Bing toolbar installed.

Either way, all publicly accesible documents will get indexed sooner or later.

This was before the document had been sent to anyone. It was still being edited, only my friend and I were working on it. Also, the documents were not public.
I would be surprised if Microsoft is intentionally indexing links in private documents, but my point stands: Google et al are remarkably good at indexing the web. If you don't want an otherwise public URL indexed you must use robots.txt or equivalent.
>If you don't want an otherwise public URL indexed you must use robots.txt or equivalent.

Which only blocks bots that respect the file...

You may be right, but I can't help but smirk at the thought of PG or Buchheit downloading and installing the Bing Toolbar ;)