Hacker News new | ask | show | jobs
by outworlder 1429 days ago
> The fact that so many popular sites/services are experiencing issues due to a single AZ failure makes me think that there is a serious shortage of good cloud architects/engineers in the industry.

Not really.

What's more likely is that their companies have other priorities. Multi-AZ architectures are more expensive to run, but that's normally not the issue. What's really costly is testing their assumptions.

Sure, by deploying your system in a Kubernetes clusters spread across 3 AZs and a HA database you are supposedly covered against failures. Except that when it actually happened, turns out your system couldn't really survive a sudden 30% capacity loss like you expected, and the ASG churning is now causing havoc with the pods who did survive.

Complex systems often fail in non-trivial ways. If you are not chaos-monkeying regularly, you won't know about those cases until they happen. At which time it's too late.

1 comments

Or, the redundancy actually causes a failure, so not only have you spent more money but you’ve reduced your availability doing so.

(Or worse, the redundancy causes a subtle failure like data loss.)

Nail on the head. The amount of times I've seen way overcomplicated redundancy setups which fail in weird and wonderful ways, causing way more downtime than just a simplier setup is pretty silly.
Don’t make the mistake of overromanticizing the simple solutions. They have nice, well understood failure conditions, and they come up relatively frequently.

When you start playing the HA game, the easy failures go off the table, and things break less often because “failures happen constantly and are auto-healed”. But when your virtual IP failover goes sideways or your cluster scheduler starts reaping systems because the metadata service is giving it useless data, you’re well into an infrequent, complex failure, and I hope you have a good ops team.

It’s always a trade off.