| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by RobertKohr 5167 days ago
	Can you limit bandwidth with AWS? Also, why would the spreadsheet be calling these images every hour. Did you have the spreadsheet open? Does google do this call even when no one is viewing the spreadsheet?

2 comments

ceejayoz 5167 days ago

> Can you limit bandwidth with AWS?

Not on S3, no.

link

justincormack 5167 days ago

You can put a robots.txt in the bucket.

link

Panos 5167 days ago

Which will be ignored by Feedfetcher :-)

Plus you cannot put a robots.txt at s3.amazonaws.com so if the url is accessed through the https://s3.amazonaws.com/.... url, the robots.txt will not work.

link

simonw 5167 days ago

You could put robots.txt in the bucket if you address it using the http://mybucket.s3.amazonaws.com/ alternative URL scheme - a robots.txt in the root of the bucket would then be available at http://mybucket.s3.amazonaws.com/robots.txt

link

Panos 5167 days ago

Yes, that would solve the issue of not being able to have your own robots.txt file and I did not know about that. On the other hand, Feedfetcher would still ignore the robots.txt

link

justincormack 5166 days ago

Google's justification for ignoring this is very weak.

link

icebraining 5166 days ago

I disagree. Feedfetcher is no different than a browser: it fetches the URL the user inserted, nothing more (unlike a spider, which discovers URLs by itself).

link

derefr 5166 days ago

I believe the parent's point was that, for the HTTPS scheme, you can't use any alternative CNAMEs, because they won't match the key S3 serves--so if your site is designed to be HTTPS-by-default, and is attached to an S3 bucket, putting a robots.txt in it is moot.

link

eli 5167 days ago

According to the article, that would not have helped; feedfetcher is meant to be manually triggered and thus does not obey robots.txt

link

tripzilch 5166 days ago

For certain definitions of "manually" :)

It's manually triggered to start downloading resources every hour regardless of whether someone needs them.

In that sense, any web spider is "manually triggered" as well ;-)

link

simonbrown 5167 days ago

The article states that (1) you can't and (2) the bot ignores it.

link

ceejayoz 5167 days ago

No, you can't. It'd have to be at the root of http://s3.amazonaws.com/.

This is mentioned specifically in the article, in fact.

link

JeremyBanks 5167 days ago

Pick subdomain-safe bucket names and you have an alternative.

    http://[bucket name].s3.amazonaws.com/robots.txt

link