How is it any harder to do in the cloud than on a rack in a warehouse? At least you don't have to muck about with cables and phoning power companies up.
Just an example: during the issue even people serving 10 ops/sec, but very important 10 ops/sec, were affected by a huge complessive load which was not their for most of the part. It's true that when you "go cloud" you don't have to manage your operations, but you are basically putting everything in the hands of other op people, and what happens to you is related to a more wide set of conditions.
So managing your stuff is hard, but you are in control and can do things in a way you believe is completely safe for you. Or you at least may incur in the same events sometimes, but perhaps paying a lot less for the same services. Or you can create your deployment with characteristics which are often impossible (a lot of RAM for each server is an example) to be cost effective in the cloud.
It's not stupid to use AWS services but is not stupid to manage your operations, either in your own hardware or at least using just bare metal and/or the virtual machines service certain providers give you, but still being in part accountable, responsabile, and in control, of your system software deployment and operations.
Yeah, but it's a very rare in-house team that can keep stuff up better than an AWS of a gcloud. I'd only consider it if I was doing really, really REALLY specialised stuff that absolutely could not be handed off to some mega-host.
I used to do infrastructure on physical hardware, and we'd go years without an outage sometimes (generators in the datacenter, diesel fuel contracts, redundant fiber providers using BGP). Doing it in the cloud is harder, because you're at the mercy of the provider when things go south, and you have no transparency into why it went wrong except what they're willing to publish. Why did it happen? Will it happen again?
I mean, you can argue that the cloud is better. But how often is Heroku and AWS down? About the same as physical providers (I concede S3 is pretty solid though).
You call up IBM. You ask for a mainframe solution for two sites. You get experts to set it up for you with your application and such. You don't worry about downtime again for at least 30 years.
You call up Bull, Fujitsu, or Unisys for the same thing.
You call up HP. You ask for a NonStop solution. You get same thing for at least 20 years.
You call up VMS Software. You ask for an OpenVMS cluster. You get same thing for at least 17 years.
Well-designed OS's, software, and hardware did cloud-style stuff for a long time before cloud existed without the downtime. Cloud certainly brought price down and flexibility up. Yet, these clouds haven't matched 70-80's technology in uptime yet despite all the brains and money thrown at them. That's a fact.
So, shouldn't be used for anything mission critical where downtime costs lots of money.
This is absolute cobblers. I worked in an IT team that had a pair of IBM mainframes that were fed and watered at crushing expense and for which even the tiniest software change required a colossal waterfall project.
One day, they failed. One went offline - for reasons never revealed, at least to me - and the secondary didn't come up. Radio silence, kaput. But an airline that housed mainframes in the same DCs had their booking system fail at exactly the same times (with national headlines to match).
The myth of mainframe uptime is exactly that. La-la-land for hardware & services salesmen.
Appreciate the counterpoint. Could in fact be a myth or legend. Lots of money to justify spreading disinformation, too. Maybe an anonymous survey by a reputable organization is in order that tries to break down what issues people have and don't have along with specific metrics. Then compile that into a big picture.
Meanwhile, the companies I've worked at all had mainframes without trouble from them that people said. Problems were virtually always the app developers or the pain of doing 21st century stuff with 60's-80's architecture or legacy code.
I've seen NonStop solution failing due to completely mundane reason of insufficient disk space after a burst of transactions. One condition for those 30-year uptimes is also a 100% predictable environment.
Never heard of that one. Funny stuff. Shows even top tier can be improved.
Edit to add: mainframes also run user and server type workloads. Some of those are predictable, some aren't. Bull's virtualize whole desktops. The mainframe as a whole, esp important services, are usually still available despite issues with those. For instance, my company splits stuff between critical on mainframe or AS/400's plus non-critical on whatever is useful ("best-of-breed" they say...). The critical stuff is either on the IBM stuff or leverages it in client-server setup. Those apps either always work or (rarely) they fail-safe in an obvious way that does no damage. Nobody I work with can remember those systems going down over 10 years they worked there. The other stuff regularly has issues across the board. The key difference is effective architecture and how it's implemented.
Caveat here. Sure, a mainframe/high end hardware & supervisor OS can run for 30 years... but the actual applications that users are facing... no, they cannot. You need to upgrade DB2 or IMS or whatever Java app you are using? There will be downtime.
Depends on the design. You have to plan for that stuff ahead of time. I'm not going to claim that's easy. It's just very helpful and there's companies that specialize in helping with it. Most common method was decomposing the app while running it on a cluster so parts of the app or nodes can be taken down. There's strategies for mainframes, too, but my experience was clusters.
Basic strategy was putting something in front of them that can redirect to the new system upon a trigger. Let's assume its functionality + tons of data. The new system first gets the data moved to it in batch form for efficiency reasons. Once it catches up enough, it starts syncing in a more online fashion until it gets to point that it's syncing in real-time. All kinds of tests are performed on that system throughout this process. Eventually, a change-over happens that should be barely perceptible. The inability to do this is usually due to fragile architecture or tightly-coupled implementations which are unfortunately all too common in enterprises.
Note: It can also help if your app was written in something like Common LISP or Erlang that supports live updates. That with the delta approach (version A->A/B->B) equals upgrades with no downtime. ;) Combining it with clustering approach is quite powerful but clustering approach is more applicable to tools majority uses.
Given they had ~300 minutes of outage in 3 years, you're looking at ~99.98% reliable in just that region. That's pretty good for a stateful serving system, and indeed you'd be pushed to do better.
So managing your stuff is hard, but you are in control and can do things in a way you believe is completely safe for you. Or you at least may incur in the same events sometimes, but perhaps paying a lot less for the same services. Or you can create your deployment with characteristics which are often impossible (a lot of RAM for each server is an example) to be cost effective in the cloud.
It's not stupid to use AWS services but is not stupid to manage your operations, either in your own hardware or at least using just bare metal and/or the virtual machines service certain providers give you, but still being in part accountable, responsabile, and in control, of your system software deployment and operations.