Hacker News new | ask | show | jobs
by dozzman 1429 days ago
My main confusion with this downtime is that neither their Cloud SQL nor Redis offerings managed to complete fail over despite my org having high availability enabled on both of those plans. Is there something I'm missing here? I would've suspected that failover would kick in for high availability instances and cause minimal downtime however its been almost 24 hours and our Cloud SQL instance is still stuck on attempting to fail over, not to mention that it comes at a premium. Wondering if anyone can help me understand what I'm missing or if the failover behaviour is not working. We've made our own workarounds in the mean time.

Relevant docs I've checked for behaviour:

https://cloud.google.com/memorystore/docs/redis/high-availab...

https://cloud.google.com/sql/docs/mysql/high-availability

EDIT: Have found out from our ops team that the SQL instance recovered around 3am so it was down for approximately 9 hours -- which is still totally useless for something deemed HA.

4 comments

Seems like they need to start issuing some refunds/credits.
IIUC, HA setting only failover across *zones in the same region*. If the whole region is down, HA won’t be helpful. In this case, the London data center is the region.
The region wasn't down though. Only one zone was down?

From earlier in the incident history:

> Cloud SQL:

> Impact/Diagnosis: Non-HA instances backed by europe-west2-a are hard-down in europe-west2-a. HA instances that were in europe-west2-a when the incident started, are down with stuck failovers.

That’s expected, Cloud SQL is not multi region. Clouds define HA as being multizonal, which you were.

Try Spanner if one region is not enough.

The whole region wasn’t down though, only zone europe-west2-a so AFAICT high availability should’ve covered this particular instance of outage.
That’s pretty terrible!