| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by manquer 2163 days ago

99% and 1% or most Fail Over setups hardly work in practice smoothly unless you have lot of money to invest in teams and hardware and do DR drills constantly and keep standby infrastructure ready to handle full load . It may work in your industry where the infra cost is trivial compared to the risk and money being made. In typical SaSS apps infra is enormous part of the costs, keeping standby ready is not feasible at all.

It is also that typically even in large organizations companies with the money and people, fire drills and DR drills go the same way, it is known there is going to be drill and people react accordingly. Chaos Monkey style testing/drills rarely happen.

I would say building resiliency to your architecture is the key to this. Just like having a single customer > 50% revenue is enormous risk for any business , relying on any single service provider is also enormous risk . In manufacturing it is common to insist on second source for a part, IBM did that to Intel for the PC which is why AMD got into x86.

In this case a proper HA would serve better - minimum of 2 CDN networks always sharing 50% of the load and the have capacity to double if required. If they cannot scale that much then distribute to 3-4 and keep traffic no more than 25-35% per provider , such that a loss one means only additional 10%-20% traffic to the rest.

Also it is important that two service providers should be actually different, if they both depend on the single and the same ISP or backbone to service an area, it is not going to be effective.

The principle should apply across the entire infra Name Servers, CDNs, load balancers, Storage, compute, DBs, Payment g/w and registrars ( use multiple domains example.com example.io each with one registrar).