Hacker News new | ask | show | jobs
by brentjanderson 925 days ago
OP Here - that's great feedback! Our hope is to build confidence in both the reliability of our product _and_ the consistency of the workloads. Of course, presenting the illusion of consistency while being flaky is far worse than managing customer expectations and taking intentional downtime to, in the long run, have better uptime.

Indeed, having periodic maintenance windows expected up-front probably leads to more robust architectures overall: customers building in the failsafes they need to tolerate downtime leads to more resilience. Teams that can trust their customers in that way can, in turn, take the time they need to make the investments they need to build a better product.

Perhaps this will be the blog post we write after our next major version upgrade: expectation setting around downtime _is_ the way to very high uptime.

2 comments

Google famously turn off a critical internal service for a minute our so, because they had promised 99.999% (or something like that) of uptime, but hadn't actually gone down in a few years.

In order to make sure that (internal) consumers of that service could handle the downtime, they introduced some artificially.

Never saw this communicated by Google, but Netflix is the company I have in mind for doing that: https://github.com/Netflix/chaosmonkey
I think it was Chubby they did that with.
Thanks for giving the source!

In my defense, by being so vague, I can't accidentally reveal company secrets. (I used to work as an SRE for Google Photos for a while.)

Yeah I’d be a lot more confident about this if you talked some about consistency vs. availability and the details of your workload that made you want to choose this trade off.

I have potentially a weird experience path here — worked with Galera a bunch early on because when we asked customers if they wanted HA they said “yes absolutely” so we sunk a ton of time into absolutely never ever going down.

When we finally presented the trade off space (basically that 10 minute downtime windows occasionally could basically guarantee that we wouldn’t have data loss) we ended up building a very different product.