Hacker News new | ask | show | jobs
by lobster_johnson 2984 days ago
Has anyone used the beta and got any feeling for how maintenance downtime impacts things? A bit nervous about how you can only set a "maintenance window" and not be able to plan ahead for disruption; as far as I can tell, they won't even tell you ahead of time. The HA seems really solid (zero-lag "regional disks"), but it's still a bit disconcerting.
3 comments

The updates take the entire instance down for 2-5 minutes each month. While you can't avoid them, they can be scheduled for particularly low traffic times. If you're trying to avoid downtime, its a giant PIA. Even with HA enabled, you still lose master, slave and read replicas. Not entirely sure what they define HA as, but a mandatory monthly downtime doesn't usually fit into mine.

[Update] That said, from what I understand, they have a road map to maintaining read replicas and queued writes. Not sure what the date on it is though.

[I'm the Cloud SQL TL] I can't comment on timelines, but we're aware that customers are interested in more features around maintenance window scheduling, deferral, and notification, as well as shorter downtime for updates and smarter scheduling within a group of replicas.
Can you confirm that it's impossible to avoid downtime, even with HA, because of forced updates?

Surely that's what HA is? no downtime as you update each node one at a time?

If it's impossible then it's a dealbreaker.

[I'm the Cloud SQL TL] Confirmed. We know it's a problem that we need to fix. HA reduces downtime in unexpected failure cases (live migration for your primary only helps in planned shutdown cases, not if the physical machine fails), but doesn't currently help with maintenance-related downtime.
What's the point of HA if there is a still a maintenance downtime?
Unfortunately last time I used CloudSQL for MySQL it was incredibly unstable. They would take down our master AND standby at the same time for maintenance. When we filed a ticket they just said it was a known bug with no plans to fix.

A major client of mine migrated to AWS because of this and other issues.

I've been thinking about moving us to Google's Cloud Platform. What I found in regards to maintenance here: https://cloud.google.com/compute/docs/regions-zones/#mainten... states that they do live migrations without any down time. Can anyone elaborate? Is this only for Compute Engine? In that case, if one can run postgres on a Compute Engine instance, why not do that instead? Surely, if one can setup a highly available postgres cluster, Google can do updates without affecting uptime???

To be fair, we wouldn't use GCP for anything but virtual servers and storage replication... I have no desire to tie us to Google's infrastructure any more than necessary.

Were your master and standby in the same availability zone? Can't you set diff maintenance windows? WTF?

https://cloud.google.com/sql/faq#maintenancerestart

According to the link above, you can taper your upgrade windows, it looks like.

"Live migration" refers to how Compute Engine transparently migrates a VM to another physical host [1]. Disk and memory is copied over, and they have some ridiculous technology that keeps network connections alive and re-attaches them to the new VM when it's been switched over, so that it causes, in principle, zero disruptions. This is much more magical than other providers, such as AWS and DigitalOcean, where such a migration results in a reboot.

You can run PostgreSQL on a VM just fine. You just have to manage itself. Cloud SQL comes with some upsides (zero management, spectacular HA failover capabilities) and some downsides (lack of extensions, lives on a separate network, no control over maintenance window); you have to decide what you're willing to live with.

You can set the upgrade window, but it can't be predicted. What you can control is the order — e.g. set your staging instance to "early" and production instance to "late", then hopefully staging should be upgraded first and you'll know ahead of the production upgrade if any issues arose.

[1] https://cloud.google.com/compute/docs/instances/live-migrati...

> they have some ridiculous technology that maintains network connections and re-routes them when everything switches to the new VM

indeed, this is the primary reason i wish to switch. i have no problem maintaining our own stuff, we do that anyway. :) thanks for the details.

If you (or the parent) are interested in some details about that ridiculous technology, there was a paper in NSDI this year: https://www.usenix.org/system/files/conference/nsdi18/nsdi18...

(disclaimer: I'm one of the many authors on the paper, although for building parts of the underlying tech, not writing the prose)

GCP has the best compute, storage, and networking of all the clouds. They are cheaper, faster, more scalable and more reliable than the others. Their managed services leave a lot to be desired (beta status, non-standard interfaces, and other limits) but if you're just looking to run VMs then that is the perfect fit for their cloud.

We consolidated everything on GKE now which lets use use VMs but still have the kubernetes control plane looking after things for us which has been great so far.

Maintenance windows are set for the cluster, not single instances. We were distributed across 3 AZs and Google had no suggestions for mitigating the ~5 minutes of downtime we were seeing every week or two.

The whole experience was so amateur and unprofessional it really soured me on GCE. They do have some cool tech but it seems like their cloud division needs to mature a bit.

There is disruption yes. It's usually short however we always see retries in our logs for a few minutes. Our app doesn't need perfect uptime though and we haven't tried the HA setup.