|
|
|
|
|
by dgroshev
1185 days ago
|
|
It seems to me that people tend to vastly overestimate their uptime requirements. "Real time 'web application'" used by hundreds of millions of people can be down for hours and yet succeed wildly, just look at Twitter, both its old failwhale and new post-Musk fragile state. Complexity, on the other hand, and thus lower iteration speed and higher fixed costs can kill a business much easier than a few seconds of downtime here and there. You don't need an "ultra reliable setup" or even a "cluster". You can have one nginx as a load balancer pointing at your unicorn/gunicorn/go thing, it's very unlikely to ever go down. You can run a cronjob with pgdump and rsync, in an off chance your server dies irrecoverably corrupting the DB (which is really unlikely for Postgres), chances are your business will survive fifteen minutes old database. Most "realtime web applications" are not aerospace, even though we like to pretend that's what we work on. It's an interesting confluence of engineering hubris and managerial FOMO that got us here. |
|
That may be true for social media apps where the Terms of Service don't include any SLAs/SLOs, but if you're a SaaS company of any kind, the agreements with clients often include uptime requirements. Their engineers will often consider some form of "x number of nines" industry standard.