| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by darkmighty 2906 days ago
	The key part of redundancy is that your "redundancy glue"[1] must be significantly more reliable than each component, including its software and implementation -- because often the glue failing in isolation itself can cause outages. So the probability of failure was simply P(single failure); now for 2x parallel redundant systems it is P(single failure)^2 + P(glue failure). If P(single failure)^2 ~ 0, we need P(glue failure) < P(single failure), at the very least. [1] i.e. the systems that interconnect the multiple redundant system, detect failures, redirect traffic, etc.

1 comments

FooHentai 2905 days ago

Very similar to the 'infrastructure as code' story, where you're still left with the construction and maintenance of the infrastructure that bootstraps the infrastructure as code systems.

Turtles all the way down, I guess.

link

darkmighty 2904 days ago

> Turtles all the way down, I guess.

Indeed it is important in this case of course that this does not happen :) To see the increased reliability and P(glue failure)<P(single failure) you need to assure the glue systems are very simple and well built -- and preferably they need to be much smaller than the system you're protecting.

Another adequate expression to apply here is

"Who watches the watchmen?"

The answer again is the watchmen must watch themselves and be very reliable.

On this topic I recommend von Neumann's (the brilliant mathematician) "Computer and the brain" book, where he explores how computing systems can be reliably interconnected and how those failure probabilities interact. He was interested on how the brain could be so robust to failure -- don't worry there's no time spent speculating on how the brain works, instead he derives from first principles properties of reliable computing components, and possible reliable designs (the brain's unknown internal workings at the time, and now to a lesser extent, would follow as a special case). He used this same approach in analyzing the principles of life, where he came up with a self-replicating machine with a tape encoding of itself, predating the discovery of DNA -- it's a very inspiring and powerful approach. Unfortunately he could not complete 'Computer and the Brain', he was in declining health due to cancer and died while writing it. What was left is still very interesting imo. He is one of those giants whose shoulders we can sit on to peek over the horizon :)

link

FooHentai 2902 days ago

Thank you.

As a caution against tenanting the deployment tools in-band, I'm reminded of an incident I witnessed about five years back. Company was moving their compute from on-prem to colo datacenters. Pretty good, mature setup: Almost entirely virtualized, 10Gb iSCSI SAN, credentials managed via a dedicated COTS tool, etc. They got most things over-the-wire to the DC. But the final migration had to be done cold - Shut the last bits down that were keeping everything running, move them to the DC and power back on.

Everything went very well until the SAN wouldn't come up. To get into the SAN and troubleshoot they needed the domain, which wasn't available. They had a local account on the SAN, the key for that was safely stored in the password manager. Which was a virtual machine. On the hyper visors. That wouldn't come up until the SAN was booted. Oops!

OK, that's a very obvious foot-in-mouth, in hindsight. As a more likely example, how about the Amazon S3 outage a few years back that wasn't reported on the status page, because the images for the status page were stored on... S3 :D

>you need to assure the glue systems are very simple and well built -- and preferably they need to be much smaller than the system you're protecting.

Absolutely agree.

link