It doesn't do any good to point the finger at your vendors when your service goes down; that data isn't useful for your customers. Never forget the lesson of http://www.whoownsmyavailability.com/
Hey, Blake from Layer here: We were not at all trying to finger-point our issues at Google, only provide our customers with up to the minute transparency on where we were with the availability issue. We have received direct feedback from our customers that they do value detailed responses even when the news isn't great.
I take full responsibility for the issues here and the team is working to remediate the exposure as quickly as possible.
I'm not sure I agree. Customers like to know why it doesn't work. If it was a physical machine, they would have said something like "the disks are broken and we are replacing them". But it is cloud and they said "Google persistent disks are currently unavailable and they are fixing it".
You can't compare persistent disks failing in a whole zone, with a RAID array failing in a single machine.
There is a reason why Amazon and Google takes EBS/Persistent Disk failures very seriously: there are not supposed to be unavailable during several hours, except if the whole datacenter is unable to operate (flood, fire, etc.), but it's not the case here.
If your RAID fails, and you have a support contract which guarantees restoration within 1 hour, and it's not restored within 1 hour, then I think you can legitimately say something was wrong at your provider. It's not pointing fingers. Everyone does mistakes. It's taking responsibility.
That said, I agree they should have run in multiple zones, as recommended by Google, if they need/want to avoid that kind of downtime.
But I maintain Google Compute Engine Persistent Disk are not supposed to fail in such a way, and I'm quite sure Google will do whatever they can to avoid this in the future, instead of saying "don't point finger at us, it's supposed to happen".
Two clarifications: the disks were not "unavailable", they had high latency (slow I/O) in one zone only (us-central1-a); and this affected only SSD PDs, not "regular" PDs. Per the SLA [1], it's "downtime" when PDs are completely unavailable for >5 minutes in at least two zones, and neither condition was met here.
All that said, people choose SSD because it's faster and has higher throughput, so SSDs not being fast is obviously a real problem for applications relying on this, and rest assured we are indeed doing whatever we can to avoid this in the future.
this is a typical Google Cloud Support response (I used to host on GCloud). Stretching the definitions to somehow get out of responsibility. If the SSDs have super high latency, then for most purposes they are indeed 'unavailable'. There is a reason why the user provisioned SSDs and not a regular disk.
> That said, I agree they should have run in multiple zones, as recommended by Google
If you don't follow your vendor's recommendations for how to use their product, how can you blame them when that exact recommendation would have saved you?
> Google Compute Engine Persistent Disk are not supposed to fail in such a way, and I'm quite sure Google will do whatever they can to avoid this in the future
Sure. And the power to my office is not supposed to go out (and I've certainly worked in places where there has never been an unplanned power outage in decades), but if my business relies on it I need a UPS.
> instead of saying "don't point finger at us, it's supposed to happen".
It's not, and they shouldn't. Also unless you know something I don't, they didn't.
> If your RAID fails, and you have a support contract which guarantees restoration within 1 hour,
But as other commenter pointed out: Google did not violate the SLA during this, apparently. So…
I take full responsibility for the issues here and the team is working to remediate the exposure as quickly as possible.