| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dijit 1504 days ago

I think it can be unfair to characterise single zone failures as being an failure to adequately deploy or architect.

There's many opportunities for failure even if only a single zone goes away; most (if not nearly all) database solutions elect leaders for example, and "brown-outs" (as in, not total failures) can lead to the leader maintaining leadership status, or at least messing with quorum.

other situations can exist where the migration out of a zone leads to hardware becoming unavailable for consumption for other people, after all, the cloud is not magic and if peoples workloads auto shift to the surrounding (unaffected) zones then it will impact peoples ability to do the same migration as all the free hardware could be used up.

I can think of dozens of examples honestly where even if you had built everything multi-zonal you could be down due to a single zone; for instance if some unknown subsystem was zonal (like IAM?) or you use regionally available persistent disks and now they suddenly perform extremely bad with writes because they can't sync to the unavailable datacenter.

I believe multi-zone is less possible than we would like it to be, there are many cases where you can commit no error but still be completely at the mercy of a single zone going away.

3 comments

scottlamb 1504 days ago

> I believe multi-zone is less possible than we would like it to be, there are many cases where you can commit no error but still be completely at the mercy of a single zone going away.

There are many understandable ways to accidentally have a single point of failure. But if your conclusion after the outage is that there was no mistake, you have made two of them, and the second is much less understandable.

link

samstave 1504 days ago

I remember when it was the control plane at AWS US-West that went out - causing mass havoc for many regardless of your architecture.

link

throwaway787544 1504 days ago

Yeah but it looks bad

link