| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Yrlec 3550 days ago
	If you consider to use Google Cloud Platform it's important to know that their SLA is practically useless. It only includes requests with HTTP Status code 500. If the system is not responding at all it's not covered by the SLA. See their definition of "Error rate": https://cloud.google.com/storage/sla This is not just a theoretical issue. In the past week we've been doing a bit more than 5 request/sec to Google Cloud Storage and according to NewRelic the average response time was 8 seconds! I.e. the service has been down and not been responding at all for large periods of time. I've been in contact with their support team and they've refused to reimburse us anything.

4 comments

jpatokal 3550 days ago

Are you by any chance using Nearline or Coldline storage? These offer lower average access time in exchange for a steep pricing discount: https://cloud.google.com/storage/docs/storage-classes

If not, drop me a line at jani at google dot com with a reference to your support case and I'll be happy to take a second look. (Yes, I work in Google Cloud Support.)

That said, we have very recently (as in, late October [1]) introduced a new pricing model for GCS with the explicit goal of reducing latency, and the SLA may be due for an update accordingly. I'll look into this.

[1] https://cloudplatform.googleblog.com/2016/10/introducing-Col...

Also, the HTTP 500 thing is specific to GCS only, other services like GCE [2] define downtime more broadly as "loss of external connectivity or persistent disk access".

[2] https://cloud.google.com/compute/sla

link

Yrlec 3550 days ago

We use Multi-regional, Regional, DRA, Nearline and Cloudline. NewRelic doesn't differentiate between the buckets. It only provide an average across all request to storage.googleapis.com. However, since you now are promising sub-second access time for all storage classes it still wouldn't explain it.

Don't you agree that it's odd to only include HTTP 500 errors in the error rate? Let's say someone hacks your DNS servers and points storage.googleapis.com to 127.0.0.1. Then the entire service would be down completely but according to your SLA you'd have 100% up time.

link

Yrlec 3549 days ago

I asked Google's support team the same thing regarding the SLA not including situations when the system would not be responding to any requests at all. This was there response: "please understand that these SLA's are meant to cover backend issues on our end. In your scenario, we would have no control over our DNS server getting hacked. I apologize if there was confusion caused."

So Google claims that it does not have control over its own DNS servers and is therefore not to blame if the DNS is pointing to the wrong IP. Not very reassuring.

link

michaelt 3550 days ago

If your product's users have unreliable connections, a GCS connection timeout might be a failure of their connections rather than GCS itself.

If a mobile app can't connect to GCS it could be that GCS is down - but more likely the user just has a weak signal.

link

Yrlec 3550 days ago

These numbers are from AWS EC2 instances. Not from mobile app users.

link

vgt 3550 days ago

It sounds like your problem is non-trivial, and is currently being diagnosed by support. Hopefully they get to the bottom of it and find the root cause very soon.

Unfortunately these things can occur in the darndest of places, as a bug in Google CLoud, an incident at GCS, or maybe even in your monitoring stack. I would encourage you to hold off judgement until root cause is identified.

One assurance I can make is that Google SRE monitors these things very carefully 24/7, and such levels of latency in the service would be treated as an incident. So it's likely something else is going on.

(work at Google Cloud, but not on GCS or support)

link

Yrlec 3550 days ago

Thanks for the follow up. This has been an open support case since October 15th. Regardless of the reason of the issue I find it absurd that only HTTP 500 errors are covered by the SLA.

link

jpalomaki 3550 days ago

IMO: In quite many cloud services the SLA is only useful as some kind of vague indicator on what the system was designed for. The reason I'm saying this is that quite often you just get a small amount of service credits if the SLA is not met.

For many SaaS businesses the service credits are quite useless, because you provide so much value on top of the cloud services you purchase. You pay $1 for cloud and charge $50 from your customer for your app. If cloud is down, you get $0.10 as credits and need to credit $5 for your own customer (in good case).

(I'm not blaming the cloud providers for this. If they would offer better terms, they would need to anyways transfer the risk to their customers and significantly raise the prices or take the risk of going bankrupt in case of major problems).

link

mikecb 3550 days ago

I'd be interested to hear more details, this doesn't square with personal experience, unless it was one of the delayed availability classes like someone else mentioned.

link

Yrlec 3550 days ago

Here are the GCS response times according to NewRelic: http://imgur.com/l1dj1Mx

link

Yrlec 3549 days ago

We are also calling S3 from our servers. S3 is receiving more requests and has had 0 issues. This is the corresponding NewRelic data for S3: http://imgur.com/HjH4f0Q

link

Ironlink 3550 days ago

Is that 8 seconds to first byte, or 8 seconds for the complete body?

No SLA I've seen guarantees a time for full body because that time fluctuates too much with both the size of the object and the current state of the internet. The new-ish refresh of the GCS lineup of services says you get sub-second access, but that has to be time to first byte, and I have a hunch that NewRelic shows you time to last byte.

If my assumptions are accurate, I would say the data you get from NewRelic does not warrant reimbursement from Google, though I might side with you if all of your objects are tiny.

link

_fool 3550 days ago

I used to work at New Relic and I think that is time to last byte. However, it would be "time to send last byte to his app" not "time to send last byte to his end users" from that view. The missing piece of the equation from his original post is the average size of the payload which would enable us to do more than speculate...

link

Yrlec 3550 days ago

This also occurs for small payloads (e.g. list operations). Google's support team has acknowledged that the problem was on their end. They sent us this message: "I wanted to let you know we have some more information regarding the root cause of the issue you faced. Further investigation with our engineering team confirmed that the issue was caused by a provisioning error in the internal Cloud Storage infrastructure that led to low performance and “Service Unavailable” errors when handling uploads to the US region."

Unfortunately the problem is still occurring after I got this message (although less frequently).

link