Hacker News new | ask | show | jobs
by justincormack 5167 days ago
You can put a robots.txt in the bucket.
4 comments

Which will be ignored by Feedfetcher :-)

Plus you cannot put a robots.txt at s3.amazonaws.com so if the url is accessed through the https://s3.amazonaws.com/.... url, the robots.txt will not work.

You could put robots.txt in the bucket if you address it using the http://mybucket.s3.amazonaws.com/ alternative URL scheme - a robots.txt in the root of the bucket would then be available at http://mybucket.s3.amazonaws.com/robots.txt
Yes, that would solve the issue of not being able to have your own robots.txt file and I did not know about that. On the other hand, Feedfetcher would still ignore the robots.txt
Google's justification for ignoring this is very weak.
I disagree. Feedfetcher is no different than a browser: it fetches the URL the user inserted, nothing more (unlike a spider, which discovers URLs by itself).
Not true. It fetches the URL every single hour, not just when the user requests it. So Google is claiming they can ignore robots.txt because it was an action performed by a user (true) but they're unleashing a huge problem with this background refreshing. Google is wasting gobs of their own money, too. What if I made a bot that generated 1000s of Google accounts with 1000s of spreadsheets hotlinking 1000s of big files stored on S3? This one guy's one file did TERABYTES of transfers over a week. The underlying problem is that Google is relying on the domain name to indicate the company size, and thus the bandwidth allocation for this service.
I believe the parent's point was that, for the HTTPS scheme, you can't use any alternative CNAMEs, because they won't match the key S3 serves--so if your site is designed to be HTTPS-by-default, and is attached to an S3 bucket, putting a robots.txt in it is moot.
According to the article, that would not have helped; feedfetcher is meant to be manually triggered and thus does not obey robots.txt
For certain definitions of "manually" :)

It's manually triggered to start downloading resources every hour regardless of whether someone needs them.

In that sense, any web spider is "manually triggered" as well ;-)

The article states that (1) you can't and (2) the bot ignores it.
No, you can't. It'd have to be at the root of http://s3.amazonaws.com/.

This is mentioned specifically in the article, in fact.

Pick subdomain-safe bucket names and you have an alternative.

    http://[bucket name].s3.amazonaws.com/robots.txt