| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by tcarn 1915 days ago
	Phew, it's not just me. Beyond me why more of these big tech companies can't build more redundancies for their services.

3 comments

cranekam 1915 days ago

These companies build for redundancy. Not in the sense of "we need 10 servers, let's have 10 more for redundancy" (that gets expensive when you have millions of hosts) but by scaling out across multiple regions/DCs/clusters/etc such that there is enough slack in the system to absorb failure of 1 or 2 resource units (DCs, fiber, whatever).

Also, widespread outages like this are seldom the result of insufficient capacity. They are almost always a perfect storm of several failures within systems that are individually build to handle adverse conditions. An example might be a bug within a task scheduling system that inadvertently scales down some critical service which in turn leads to something else failing to reach consensus or read configuration or who knows what. The point is that each of these components is designed and built to handle failure but something the holes in the cheese line up and the whole thing fails.

In this case since IG, WA and FB were affected it's reasonable to guess the failure was in some shared component like load balancing or task placement, though (as hinted at above) the origin of the fault is not necessarily in that component directly.

link

WanderPanda 1915 days ago

IIRC there was a situation where google built a circular dependency in their own cloud services and they had massive issues bootstrapping the systems again after they went haywire

link

sodality2 1915 days ago

The scale of facebook is beyond what you probably imagine it to be.

link

justapassenger 1915 days ago

It's not just about adding redundancies. Redundancies don't protect you against bugs, and they're itself very complex, so they introduce more opportunity for errors. Even with best redundancy, you'll have incidents from time to time.

link

capableweb 1915 days ago

It doesn't help either that Facebook made three separate services that should have nothing to do with each other, talk to each other and all route their traffic through the same infrastructure.

What they did was purposefully remove redundancy they had, in order to be able to track people more (in order to profit more) and possibly scale easier. Doing nothing would have been easier but yet they still did it.

link

justapassenger 1915 days ago

If you think that maintaining 3 completely separate stacks for 3 different services within the same company is making all of them more reliable, I don't think you ever worked on a big scale services.

link

capableweb 1915 days ago

No, I'm hinting at that have one company running these three services in the first place is wrong. Should be three independent companies as they are really three different services, but lord knows governments does nothing to prevent monopolies these days.

link

bananaface 1915 days ago

Facebook took $85 billion in revenue last year and has thousands of developers on the payroll. They could absolutely maintain 3 separate services.

Bear in mind WhatsApp used to maintain a userbase of 200 million with 50 total employees (not even just developers).

link

capableweb 1915 days ago

> They can absolutely maintain 3 different services

Lol, I'd believe you when they did purchased the service but the countless amount of times it went down since then obviously proves that they cannot even maintain the services when they are folded together on the same infrastructure.

It was a long time ago Facebook employed the best of the best. Seems like it's mostly average developers and infrastructure people there now just trying to hold up the house of cards they built.

link

bananaface 1915 days ago

Hah, to be clear that is what I meant - the idea that Facebook doesn't have the resources is ludicrous, the problem runs deeper.

link