| So I've been mulling this stupid thought for a while (and disclaimer that it's extremely useful for these outage stories to make it to the front-page to help everyone who is getting paged with p1s out). But, does it really matter? I read people reacting strongly to these outages, suggesting that due dilligence wasn't done to use a 3rd party for this or that. Or that a system engineered to reach anything less than 100% uptime is professional negligence. However from the top of my head we've had AWS outages, Gmail outages, Azure outages, DNS outages, GitHub outages, whatever else. All these hugely profitable companies are messing this stuff up constantly. Why are any of us going to do any better and why does a few hours of downtime ultimately matter? I think it's partly living somewhere where a volcano the next island over can shut down connections to the outside world for almost a week. Life doesn't have an SLA, systems should aim for reasonable uptime but at the end of the day the systems come back online at some point and we all move on. Just catch up on emails or something. I dislike the culture of demanding hyper perfection and that we should be prepared to do unhealthy shift patterns to avoid a moment of downtime in UTC - 11 or something. My view is increasingly these outages are healthy since they force us to confront the fallibility of the systems we build and accept the chaos wins out in the end, even if just for a few hours. |
For example, I'm building a note-taking / knowledge base platform, and we were having some reliability issues last year when our platform and devops process was still a bit nascent. We had a user that was (predictably) using our platform to take notes / study for an exam, which was open book. On the day of her exam our servers went down and she was justifiably anxious that things wouldn't be back before it was time for her exam to start. Luckily I was able to stabilize everything before then and her exam went great in the end, but it might not have happened that way.
Of course most on HN would probably point out that this is obviously why your personal notes should always be hosted / backed up locally, but I of course took this as a personal mission to improve our reliability so that our users never had to deal with this again. And since then I'm proud to say we've maintained 99.99% uptime[1]. So yes, there are definitely many situations where we can and should take a more laid back approach, but sometimes there are deadlines outside of your control and having a critical piece of software go offline exactly when you need it can be a terrible experience.
[1] https://status.supernotes.app/