Hacker News new | ask | show | jobs
by jpatokal 3503 days ago
Are you by any chance using Nearline or Coldline storage? These offer lower average access time in exchange for a steep pricing discount: https://cloud.google.com/storage/docs/storage-classes

If not, drop me a line at jani at google dot com with a reference to your support case and I'll be happy to take a second look. (Yes, I work in Google Cloud Support.)

That said, we have very recently (as in, late October [1]) introduced a new pricing model for GCS with the explicit goal of reducing latency, and the SLA may be due for an update accordingly. I'll look into this.

[1] https://cloudplatform.googleblog.com/2016/10/introducing-Col...

Also, the HTTP 500 thing is specific to GCS only, other services like GCE [2] define downtime more broadly as "loss of external connectivity or persistent disk access".

[2] https://cloud.google.com/compute/sla

1 comments

We use Multi-regional, Regional, DRA, Nearline and Cloudline. NewRelic doesn't differentiate between the buckets. It only provide an average across all request to storage.googleapis.com. However, since you now are promising sub-second access time for all storage classes it still wouldn't explain it.

Don't you agree that it's odd to only include HTTP 500 errors in the error rate? Let's say someone hacks your DNS servers and points storage.googleapis.com to 127.0.0.1. Then the entire service would be down completely but according to your SLA you'd have 100% up time.

I asked Google's support team the same thing regarding the SLA not including situations when the system would not be responding to any requests at all. This was there response: "please understand that these SLA's are meant to cover backend issues on our end. In your scenario, we would have no control over our DNS server getting hacked. I apologize if there was confusion caused."

So Google claims that it does not have control over its own DNS servers and is therefore not to blame if the DNS is pointing to the wrong IP. Not very reassuring.

If your product's users have unreliable connections, a GCS connection timeout might be a failure of their connections rather than GCS itself.

If a mobile app can't connect to GCS it could be that GCS is down - but more likely the user just has a weak signal.

These numbers are from AWS EC2 instances. Not from mobile app users.
It sounds like your problem is non-trivial, and is currently being diagnosed by support. Hopefully they get to the bottom of it and find the root cause very soon.

Unfortunately these things can occur in the darndest of places, as a bug in Google CLoud, an incident at GCS, or maybe even in your monitoring stack. I would encourage you to hold off judgement until root cause is identified.

One assurance I can make is that Google SRE monitors these things very carefully 24/7, and such levels of latency in the service would be treated as an incident. So it's likely something else is going on.

(work at Google Cloud, but not on GCS or support)

Thanks for the follow up. This has been an open support case since October 15th. Regardless of the reason of the issue I find it absurd that only HTTP 500 errors are covered by the SLA.