Hacker News new | ask | show | jobs
by Yrlec 3503 days ago
If you consider to use Google Cloud Platform it's important to know that their SLA is practically useless. It only includes requests with HTTP Status code 500. If the system is not responding at all it's not covered by the SLA. See their definition of "Error rate": https://cloud.google.com/storage/sla

This is not just a theoretical issue. In the past week we've been doing a bit more than 5 request/sec to Google Cloud Storage and according to NewRelic the average response time was 8 seconds! I.e. the service has been down and not been responding at all for large periods of time. I've been in contact with their support team and they've refused to reimburse us anything.

4 comments

Are you by any chance using Nearline or Coldline storage? These offer lower average access time in exchange for a steep pricing discount: https://cloud.google.com/storage/docs/storage-classes

If not, drop me a line at jani at google dot com with a reference to your support case and I'll be happy to take a second look. (Yes, I work in Google Cloud Support.)

That said, we have very recently (as in, late October [1]) introduced a new pricing model for GCS with the explicit goal of reducing latency, and the SLA may be due for an update accordingly. I'll look into this.

[1] https://cloudplatform.googleblog.com/2016/10/introducing-Col...

Also, the HTTP 500 thing is specific to GCS only, other services like GCE [2] define downtime more broadly as "loss of external connectivity or persistent disk access".

[2] https://cloud.google.com/compute/sla

We use Multi-regional, Regional, DRA, Nearline and Cloudline. NewRelic doesn't differentiate between the buckets. It only provide an average across all request to storage.googleapis.com. However, since you now are promising sub-second access time for all storage classes it still wouldn't explain it.

Don't you agree that it's odd to only include HTTP 500 errors in the error rate? Let's say someone hacks your DNS servers and points storage.googleapis.com to 127.0.0.1. Then the entire service would be down completely but according to your SLA you'd have 100% up time.

I asked Google's support team the same thing regarding the SLA not including situations when the system would not be responding to any requests at all. This was there response: "please understand that these SLA's are meant to cover backend issues on our end. In your scenario, we would have no control over our DNS server getting hacked. I apologize if there was confusion caused."

So Google claims that it does not have control over its own DNS servers and is therefore not to blame if the DNS is pointing to the wrong IP. Not very reassuring.

If your product's users have unreliable connections, a GCS connection timeout might be a failure of their connections rather than GCS itself.

If a mobile app can't connect to GCS it could be that GCS is down - but more likely the user just has a weak signal.

These numbers are from AWS EC2 instances. Not from mobile app users.
It sounds like your problem is non-trivial, and is currently being diagnosed by support. Hopefully they get to the bottom of it and find the root cause very soon.

Unfortunately these things can occur in the darndest of places, as a bug in Google CLoud, an incident at GCS, or maybe even in your monitoring stack. I would encourage you to hold off judgement until root cause is identified.

One assurance I can make is that Google SRE monitors these things very carefully 24/7, and such levels of latency in the service would be treated as an incident. So it's likely something else is going on.

(work at Google Cloud, but not on GCS or support)

Thanks for the follow up. This has been an open support case since October 15th. Regardless of the reason of the issue I find it absurd that only HTTP 500 errors are covered by the SLA.
IMO: In quite many cloud services the SLA is only useful as some kind of vague indicator on what the system was designed for. The reason I'm saying this is that quite often you just get a small amount of service credits if the SLA is not met.

For many SaaS businesses the service credits are quite useless, because you provide so much value on top of the cloud services you purchase. You pay $1 for cloud and charge $50 from your customer for your app. If cloud is down, you get $0.10 as credits and need to credit $5 for your own customer (in good case).

(I'm not blaming the cloud providers for this. If they would offer better terms, they would need to anyways transfer the risk to their customers and significantly raise the prices or take the risk of going bankrupt in case of major problems).

I'd be interested to hear more details, this doesn't square with personal experience, unless it was one of the delayed availability classes like someone else mentioned.
Here are the GCS response times according to NewRelic: http://imgur.com/l1dj1Mx
We are also calling S3 from our servers. S3 is receiving more requests and has had 0 issues. This is the corresponding NewRelic data for S3: http://imgur.com/HjH4f0Q
Is that 8 seconds to first byte, or 8 seconds for the complete body?

No SLA I've seen guarantees a time for full body because that time fluctuates too much with both the size of the object and the current state of the internet. The new-ish refresh of the GCS lineup of services says you get sub-second access, but that has to be time to first byte, and I have a hunch that NewRelic shows you time to last byte.

If my assumptions are accurate, I would say the data you get from NewRelic does not warrant reimbursement from Google, though I might side with you if all of your objects are tiny.

I used to work at New Relic and I think that is time to last byte. However, it would be "time to send last byte to his app" not "time to send last byte to his end users" from that view. The missing piece of the equation from his original post is the average size of the payload which would enable us to do more than speculate...
This also occurs for small payloads (e.g. list operations). Google's support team has acknowledged that the problem was on their end. They sent us this message: "I wanted to let you know we have some more information regarding the root cause of the issue you faced. Further investigation with our engineering team confirmed that the issue was caused by a provisioning error in the internal Cloud Storage infrastructure that led to low performance and “Service Unavailable” errors when handling uploads to the US region."

Unfortunately the problem is still occurring after I got this message (although less frequently).