Hacker News new | ask | show | jobs
by natbennett 926 days ago
The approach here is interesting and well-documented! However, this line gives me pause—

> Modern customers expect 100% availability.

This is not my preference as a customer, nor has it been my experience as a vendor. For many workloads consistency is much more important than availability. I’m often relieved when I see a vendor announce a downtime window because it suggests they’re being sensible with my data.

2 comments

OP Here - that's great feedback! Our hope is to build confidence in both the reliability of our product _and_ the consistency of the workloads. Of course, presenting the illusion of consistency while being flaky is far worse than managing customer expectations and taking intentional downtime to, in the long run, have better uptime.

Indeed, having periodic maintenance windows expected up-front probably leads to more robust architectures overall: customers building in the failsafes they need to tolerate downtime leads to more resilience. Teams that can trust their customers in that way can, in turn, take the time they need to make the investments they need to build a better product.

Perhaps this will be the blog post we write after our next major version upgrade: expectation setting around downtime _is_ the way to very high uptime.

Google famously turn off a critical internal service for a minute our so, because they had promised 99.999% (or something like that) of uptime, but hadn't actually gone down in a few years.

In order to make sure that (internal) consumers of that service could handle the downtime, they introduced some artificially.

Never saw this communicated by Google, but Netflix is the company I have in mind for doing that: https://github.com/Netflix/chaosmonkey
I think it was Chubby they did that with.
Thanks for giving the source!

In my defense, by being so vague, I can't accidentally reveal company secrets. (I used to work as an SRE for Google Photos for a while.)

Yeah I’d be a lot more confident about this if you talked some about consistency vs. availability and the details of your workload that made you want to choose this trade off.

I have potentially a weird experience path here — worked with Galera a bunch early on because when we asked customers if they wanted HA they said “yes absolutely” so we sunk a ton of time into absolutely never ever going down.

When we finally presented the trade off space (basically that 10 minute downtime windows occasionally could basically guarantee that we wouldn’t have data loss) we ended up building a very different product.

depends who the customer is, I'm a customer of AWS and I expect 100% availability, mostly because my customers are everywhere in the world and there's no available window for downtime
If you have this 100% availability expectation you're going to have to face the reality that DBMS versions fall out of support, you will have to upgrade or AWS will force-upgrade you their way, the AWS-provided default mechanism has significant DB-size dependent downtime (in order to maintain consistency, and you really don't want to lose that), and that the only alternative is to go through the pain of sifting through your database estate and logically replicating table by table with verification as shown in this article, with care especially for large tables and reindexing, and you can't avoid that if you have the (IMO mostly unreasonable) expectation of 100% availability. Change the wheel mid-journey or take a pitstop.
The article is entirely about tooling for safely changing wheels mid-journey. In that context, it's not weird to expect the database to remain available during updates.

Yes, it will require more work and time than just taking the database down and performing the update while it is offline. But as long as the database remains available it doesn't really matter if the update takes 5 minutes or 2 days, just that you can do changes faster than they appear. Since DBMS updates happen at most every few months, that should hopefully not be a problem.

At one of my previous workplaces we had a multi-TB table that could take several days to migrate with the online tooling, and would take 12+ hours to migrate even offline. Nobody wanted to take 12+ hours of downtime (for a busy customer-oriented website) but as long as the db stayed up nobody much cared how long it took.

>I'm a customer of AWS and I expect 100% availability,

AWS neither provides nor promises 100% availability. AWS will have SLAs on various services with the penalty only being a discount on your bill.

It's _your_ job to make your service resilient to a point where you are comfortable with your mitigations.

But you don't expect 100% availability from every server for every service in every region do you?
>I'm a customer of AWS and I expect 100% availability

Well, you aren't gonna get it, it's a myth, like "5 nines" and such are, based on that businesses can foresee the unforeseen and plan ahead.

Whether a service is distributed or not, at some point some issue will come up and availability is going to stop for a while.

If you expect 100% and you’re making business decisions based on that expectation, I encourage you to increase your sophistication about reliability before it costs you a great deal of money.