|
You can't compare persistent disks failing in a whole zone, with a RAID array failing in a single machine. There is a reason why Amazon and Google takes EBS/Persistent Disk failures very seriously: there are not supposed to be unavailable during several hours, except if the whole datacenter is unable to operate (flood, fire, etc.), but it's not the case here. If your RAID fails, and you have a support contract which guarantees restoration within 1 hour, and it's not restored within 1 hour, then I think you can legitimately say something was wrong at your provider. It's not pointing fingers. Everyone does mistakes. It's taking responsibility. That said, I agree they should have run in multiple zones, as recommended by Google, if they need/want to avoid that kind of downtime. But I maintain Google Compute Engine Persistent Disk are not supposed to fail in such a way, and I'm quite sure Google will do whatever they can to avoid this in the future, instead of saying "don't point finger at us, it's supposed to happen". |
[1] https://cloud.google.com/compute/sla
All that said, people choose SSD because it's faster and has higher throughput, so SSDs not being fast is obviously a real problem for applications relying on this, and rest assured we are indeed doing whatever we can to avoid this in the future.
Disclaimer: I work in Google Cloud Support.