| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by cranekam 1916 days ago

These companies build for redundancy. Not in the sense of "we need 10 servers, let's have 10 more for redundancy" (that gets expensive when you have millions of hosts) but by scaling out across multiple regions/DCs/clusters/etc such that there is enough slack in the system to absorb failure of 1 or 2 resource units (DCs, fiber, whatever).

Also, widespread outages like this are seldom the result of insufficient capacity. They are almost always a perfect storm of several failures within systems that are individually build to handle adverse conditions. An example might be a bug within a task scheduling system that inadvertently scales down some critical service which in turn leads to something else failing to reach consensus or read configuration or who knows what. The point is that each of these components is designed and built to handle failure but something the holes in the cheese line up and the whole thing fails.

In this case since IG, WA and FB were affected it's reasonable to guess the failure was in some shared component like load balancing or task placement, though (as hinted at above) the origin of the fault is not necessarily in that component directly.

1 comments

WanderPanda 1916 days ago

IIRC there was a situation where google built a circular dependency in their own cloud services and they had massive issues bootstrapping the systems again after they went haywire

link