Hacker News new | ask | show | jobs
by ngrilly 3644 days ago
You can't compare persistent disks failing in a whole zone, with a RAID array failing in a single machine.

There is a reason why Amazon and Google takes EBS/Persistent Disk failures very seriously: there are not supposed to be unavailable during several hours, except if the whole datacenter is unable to operate (flood, fire, etc.), but it's not the case here.

If your RAID fails, and you have a support contract which guarantees restoration within 1 hour, and it's not restored within 1 hour, then I think you can legitimately say something was wrong at your provider. It's not pointing fingers. Everyone does mistakes. It's taking responsibility.

That said, I agree they should have run in multiple zones, as recommended by Google, if they need/want to avoid that kind of downtime.

But I maintain Google Compute Engine Persistent Disk are not supposed to fail in such a way, and I'm quite sure Google will do whatever they can to avoid this in the future, instead of saying "don't point finger at us, it's supposed to happen".

2 comments

Two clarifications: the disks were not "unavailable", they had high latency (slow I/O) in one zone only (us-central1-a); and this affected only SSD PDs, not "regular" PDs. Per the SLA [1], it's "downtime" when PDs are completely unavailable for >5 minutes in at least two zones, and neither condition was met here.

[1] https://cloud.google.com/compute/sla

All that said, people choose SSD because it's faster and has higher throughput, so SSDs not being fast is obviously a real problem for applications relying on this, and rest assured we are indeed doing whatever we can to avoid this in the future.

Disclaimer: I work in Google Cloud Support.

this is a typical Google Cloud Support response (I used to host on GCloud). Stretching the definitions to somehow get out of responsibility. If the SSDs have super high latency, then for most purposes they are indeed 'unavailable'. There is a reason why the user provisioned SSDs and not a regular disk.
> That said, I agree they should have run in multiple zones, as recommended by Google

If you don't follow your vendor's recommendations for how to use their product, how can you blame them when that exact recommendation would have saved you?

> Google Compute Engine Persistent Disk are not supposed to fail in such a way, and I'm quite sure Google will do whatever they can to avoid this in the future

Sure. And the power to my office is not supposed to go out (and I've certainly worked in places where there has never been an unplanned power outage in decades), but if my business relies on it I need a UPS.

> instead of saying "don't point finger at us, it's supposed to happen".

It's not, and they shouldn't. Also unless you know something I don't, they didn't.

> If your RAID fails, and you have a support contract which guarantees restoration within 1 hour,

But as other commenter pointed out: Google did not violate the SLA during this, apparently. So…

> > instead of saying "don't point finger at us, it's supposed to happen".

> It's not, and they shouldn't. Also unless you know something I don't, they didn't.

Sorry, my comment was confusing. Google of course never said or wrote such a thing.

> Google did not violate the SLA during this, apparently.

I agree.