| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by flyt 3644 days ago
	It doesn't do any good to point the finger at your vendors when your service goes down; that data isn't useful for your customers. Never forget the lesson of http://www.whoownsmyavailability.com/

2 comments

blakewatters 3643 days ago

Hey, Blake from Layer here: We were not at all trying to finger-point our issues at Google, only provide our customers with up to the minute transparency on where we were with the availability issue. We have received direct feedback from our customers that they do value detailed responses even when the news isn't great.

I take full responsibility for the issues here and the team is working to remediate the exposure as quickly as possible.

link

ngrilly 3644 days ago

I'm not sure I agree. Customers like to know why it doesn't work. If it was a physical machine, they would have said something like "the disks are broken and we are replacing them". But it is cloud and they said "Google persistent disks are currently unavailable and they are fixing it".

link

knorker 3644 days ago

But the real reason is "we didn't set up our system properly".

This is like saying "Hitachi Storage hard drives broke" when you actually mean "we didn't run RAID".

link

ngrilly 3644 days ago

You can't compare persistent disks failing in a whole zone, with a RAID array failing in a single machine.

There is a reason why Amazon and Google takes EBS/Persistent Disk failures very seriously: there are not supposed to be unavailable during several hours, except if the whole datacenter is unable to operate (flood, fire, etc.), but it's not the case here.

If your RAID fails, and you have a support contract which guarantees restoration within 1 hour, and it's not restored within 1 hour, then I think you can legitimately say something was wrong at your provider. It's not pointing fingers. Everyone does mistakes. It's taking responsibility.

That said, I agree they should have run in multiple zones, as recommended by Google, if they need/want to avoid that kind of downtime.

But I maintain Google Compute Engine Persistent Disk are not supposed to fail in such a way, and I'm quite sure Google will do whatever they can to avoid this in the future, instead of saying "don't point finger at us, it's supposed to happen".

link

jpatokal 3644 days ago

Two clarifications: the disks were not "unavailable", they had high latency (slow I/O) in one zone only (us-central1-a); and this affected only SSD PDs, not "regular" PDs. Per the SLA [1], it's "downtime" when PDs are completely unavailable for >5 minutes in at least two zones, and neither condition was met here.

[1] https://cloud.google.com/compute/sla

All that said, people choose SSD because it's faster and has higher throughput, so SSDs not being fast is obviously a real problem for applications relying on this, and rest assured we are indeed doing whatever we can to avoid this in the future.

Disclaimer: I work in Google Cloud Support.

link

pdeva1 3643 days ago

this is a typical Google Cloud Support response (I used to host on GCloud). Stretching the definitions to somehow get out of responsibility. If the SSDs have super high latency, then for most purposes they are indeed 'unavailable'. There is a reason why the user provisioned SSDs and not a regular disk.

link

knorker 3644 days ago

> That said, I agree they should have run in multiple zones, as recommended by Google

If you don't follow your vendor's recommendations for how to use their product, how can you blame them when that exact recommendation would have saved you?

> Google Compute Engine Persistent Disk are not supposed to fail in such a way, and I'm quite sure Google will do whatever they can to avoid this in the future

Sure. And the power to my office is not supposed to go out (and I've certainly worked in places where there has never been an unplanned power outage in decades), but if my business relies on it I need a UPS.

> instead of saying "don't point finger at us, it's supposed to happen".

It's not, and they shouldn't. Also unless you know something I don't, they didn't.

> If your RAID fails, and you have a support contract which guarantees restoration within 1 hour,

But as other commenter pointed out: Google did not violate the SLA during this, apparently. So…

link

ngrilly 3644 days ago

> > instead of saying "don't point finger at us, it's supposed to happen".

> It's not, and they shouldn't. Also unless you know something I don't, they didn't.

Sorry, my comment was confusing. Google of course never said or wrote such a thing.

> Google did not violate the SLA during this, apparently.

I agree.

link