| I feel like this is a pretty short-sighted solution. > you either need a highly competent team of domain experts, which also incurs a significant cost You can build this team and domain knowledge from the ground up. Internal documentation goes a long, long way too. From working at bare-metal shops off-and-on over the last decade, the documentation capturing why and how are extremely important. > you opt for cloud services where all of this management is taken care of for you Is it though? One of the biggest differences between a bare-metal shop and a cloud shop is when shit hits the fan. When the cloud goes down, everyone is sitting around twiddling their thumbs while customer money flies away. There isn't anything anyone can do. Maybe there will be discussions about how we should make the staging/test envs in more regions ... but that's expensive. When the cloud does come back, you'll be spending more hours possibly rebooting things to get everything back into a good state, maybe even shipping code. When bare-metal goes down, a team of highly competent people, who know it in-and-out are giving you minute-by-minute status updates, predictions on when it will be back, etc. They can even bring certain systems back online in whatever order they want instead of a random order. Thus you can get core services back online pretty quickly, while everything runs degraded. Like all things in software, there are tradeoffs. |
Should you require greater than 99.95% uptime for particularly critical operations, then opting for a multi-region approach, such as using multi-region storage buckets, is advisable. It's also worth mentioning that I have never experienced a full 8 hours of downtime at once in a given year. It has usually been a case of increased error rates or heightened latency due to the inherent redundancy provided by availability zones within each region. Just make sure your network calls have a retry mechanism and you should be fine in almost all cases.