Hacker News new | ask | show | jobs
by freehunter 2033 days ago
>you can’t trust one provider at all

The hard part with multi-cloud is, you're just increasing your risk of being impacted by someone's failure. Sure if you're all-in on AWS and AWS goes down, you're all-out. But if you're on [AWS, GCP] and GCP goes down, you're down anyway. Even though AWS is up, you're down because Google went down. And if you're on [AWS, GCP, Azure] and Azure goes down, it doesn't matter than AWS and GCP are up... you're down because Azure is down. The only way around that is architecting your business to run with only one of those vendors, which means you're paying 3x more than you need to 99.99999% of the time.

The probability that one of [AWS, Azure, GCP] is down is way higher than the probability that just one of them is down. And the probability that your two cages in your datacenter is down is way higher than the probability that any one of the hyperscalers is down.

2 comments

> which means you're paying 3x more than you need to 99.99999% of the time.

This would be a poor decision. If you assume AWS, GCP, and Azure would fail independently, you can pay 1.5x. Each of the 3 services would be scaled to take 50% of your traffic. If any one fails, you would then still be able to handle 100%. This is a common way to structure applications. Assuming independence means that more replicas result in less overprovisioning. 1 replica means needing to provision 2x. Having 5 independent replicas means, you need to provision 1.25x to be resilient against one failure as each replica will be scaled at 25%.

In general, N replicas need N/(N-1) over provisioning to be resilient against one replica failing.

I disagree. It’s about mitigating the risk of a single provider’s failure. Single providers go down all the time. We’ve seen it from all three major cloud vendors.
You disagree with what? That relying on three vendors increases your risk of being impacted by one? That's just statistics. You can disagree with it, but that doesn't make it incorrect.

Or do you disagree that planning for a total failure of one and running redundant workloads on other vendors increases your costs 99.99999% of the time? Because that's a fairly standard SLA from each of the major vendors. Let's even reduce it to EC2's SLA, 99.99%. So 99.99% of the time you're paying 3x as much as you need to be paying just to maintain your services an extra four hours per year. Again, you can disagree with that but that doesn't make it incorrect.

Some businesses might need that extra four hours, the cost of the extra services might be cheaper than the cost of four hours of downtime per year. But you're not going to find many businesses like that. Either you're running completely redundant workloads, paying 3x as much for an extra 4 hours per year, or you're going to be taken offline when any one of the three go down independently of each other.

Single providers go down, yes. And three providers go down three times as often as one. Either you're massively overspending or you're tripling your risk of downtime. If multi-cloud worked, you'd be hearing people talking about it and their success stories would fill the front page of Hacker News. They don't, because it doesn't.

How do you test failover from provider-wide outages?

I’ve never heard of an untested failover mechanism that worked. Most places are afraid to invoke such a thing, even during a major outage.

That’s fairly simple. Regular scenario planning, drills and suitable chaos engineering tooling.

Being afraid of failures is a massive sign of problems. I’ve worked in those sorts of places before.