| Is downtime an actual, business-killing problem in practice? In my experience, very rarely so, and clouds also have downtime (over which you have much less control) and seems like we live with it just fine. > when you need a 3 hours downtime on prod because you need to reboot and reconfigure your services What about when your RDS instance fails and is then stuck on "modifying" for an indefinite period of time (ended up being 12 hours, and I suspect an AWS engineer eventually did a manual operation to fix it) and you have to restore from a backup and rebuild the missing data manually from other sources such as logs in the meantime just to get back online? I've seen it happen and would've much preferred having the option to SSH in and recover it manually. Scaling is less of a problem when bare-metal is so cheap that you can significantly overprovision and never have to worry about autoscaling. This also means you need much less moving parts that can break and take your service down. I'm not saying that the cloud is always bad, but a hybrid approach would be the most pragmatic choice. For raw compute and bandwidth, bare-metal is orders of magnitude cheaper. You can still use the cloud's managed services from those if you need them, though given how cheap bare-metal is you may realize that you no longer need a lot of them. When it comes to management/sysadmin work, every shop that uses the cloud beyond very small projects that are fully on a PaaS such as Heroku has a dedicated DevOps person (or more), no different from bare-metal in terms of effort. I'd argue it's more effort than bare-metal because clouds and their associated services, APIs and tooling (Terraform, etc) change much more frequently than old-school Linux and hardware. |
Of course it is. If you make a service that other people use as part of their workflows or business, they will be switching providers if you’re the only one that routinely goes down.
This is also a slippery slope. If your engineering team is in the habit of shrugging off downtime as no big deal, it tends to get worse and worse as time goes on, staff turns over, systems scale up, and load increases. If you can’t manage to keep downtime to a minimum when you’re small, it’s going to be much worse when you’re bigger.