These companies build for redundancy. Not in the sense of "we need 10 servers, let's have 10 more for redundancy" (that gets expensive when you have millions of hosts) but by scaling out across multiple regions/DCs/clusters/etc such that there is enough slack in the system to absorb failure of 1 or 2 resource units (DCs, fiber, whatever).
Also, widespread outages like this are seldom the result of insufficient capacity. They are almost always a perfect storm of several failures within systems that are individually build to handle adverse conditions. An example might be a bug within a task scheduling system that inadvertently scales down some critical service which in turn leads to something else failing to reach consensus or read configuration or who knows what. The point is that each of these components is designed and built to handle failure but something the holes in the cheese line up and the whole thing fails.
In this case since IG, WA and FB were affected it's reasonable to guess the failure was in some shared component like load balancing or task placement, though (as hinted at above) the origin of the fault is not necessarily in that component directly.
IIRC there was a situation where google built a circular dependency in their own cloud services and they had massive issues bootstrapping the systems again after they went haywire
It's not just about adding redundancies. Redundancies don't protect you against bugs, and they're itself very complex, so they introduce more opportunity for errors. Even with best redundancy, you'll have incidents from time to time.
It doesn't help either that Facebook made three separate services that should have nothing to do with each other, talk to each other and all route their traffic through the same infrastructure.
What they did was purposefully remove redundancy they had, in order to be able to track people more (in order to profit more) and possibly scale easier. Doing nothing would have been easier but yet they still did it.
If you think that maintaining 3 completely separate stacks for 3 different services within the same company is making all of them more reliable, I don't think you ever worked on a big scale services.
No, I'm hinting at that have one company running these three services in the first place is wrong. Should be three independent companies as they are really three different services, but lord knows governments does nothing to prevent monopolies these days.
> They can absolutely maintain 3 different services
Lol, I'd believe you when they did purchased the service but the countless amount of times it went down since then obviously proves that they cannot even maintain the services when they are folded together on the same infrastructure.
It was a long time ago Facebook employed the best of the best. Seems like it's mostly average developers and infrastructure people there now just trying to hold up the house of cards they built.
Also, widespread outages like this are seldom the result of insufficient capacity. They are almost always a perfect storm of several failures within systems that are individually build to handle adverse conditions. An example might be a bug within a task scheduling system that inadvertently scales down some critical service which in turn leads to something else failing to reach consensus or read configuration or who knows what. The point is that each of these components is designed and built to handle failure but something the holes in the cheese line up and the whole thing fails.
In this case since IG, WA and FB were affected it's reasonable to guess the failure was in some shared component like load balancing or task placement, though (as hinted at above) the origin of the fault is not necessarily in that component directly.