| > I've seen many multi-datacenter self-managed deployments provide better uptime than Amazon web services. Self-managed, multi-DC? Congrats on having a lot of money to blow, I guess. Yes, with enough money you can match Amazon for uptime or scalability or whatever metric you prefer. For the same money you can probably buy triple the capacity in Amazon or your preferred cloud provider, so this is mostly a game for people with really deep pockets, really large scale, or really poor budgeting. > You are forgetting that when you own the hardware, you can actually orchestrate maintenance windows with live migrations, etc and then take down an entire datacenter with no impact. How many DCs are you talking about here? Are you self-managing in 4+ DCs? Or are you running in 2 DCs and your capacity is overbuilt by 100+%? In either case, deep pockets are nice to have. Also, does your maintenance strategy seriously involve bringing down entire DCs? This is kind of absurd and makes half of me jealous of the bathtub full of cash you must bathe in. It makes the other half of me question some engineering decisions you've apparently made. > all of S3 is down during critical business hours. I have trouble believing people when they claim to do significantly better than Amazon (or another favorite cloud provider) for infrastructure uptime. If you stand up a fairly complex system comprised of a number of loosely-coupled services, you're going to end up experiencing some outages, because you'll face the same challenges as Amazon and those guys aren't idiots. You'll lose your message queue due to a bug, or you'll lose a network switch and realize your failover takes 30 minutes to complete instead of the 5 seconds you hoped for, or you'll accidentally DDOS a subsystem when exercising a failover or a system upgrade, or something else. Complex systems fail and when people tell me they built an "internet scale" system with better uptime than Amazon, I'm left to assume that they probably just do a bad job of tracking uptime or else that their systems are not at the scale they imagine. Everyone who builds large systems experiences outages. |
That needs a dollar-for-dollar or something to that effect qualification. It's possible but very expensive.
There are for instance long running (and I mean really long running, many years or even decades) experiments where any amount of downtime would cause a do-over.
One of my customers had something like this on the go. The amount of money they spent on their power and network redundancy was off the scale, but they definitely had better uptime than Amazon.
Their problems were more along the lines of 'this piece of equipment is nearly eol, how do we replace it without interrupting the work it does'.