Hacker News new | ask | show | jobs
by duxup 2906 days ago
Former network engineer here, can confirm. Time and again I've seen redundant systems create their own problems where without all that extra complexity things would have been fine.

Even ISPs and CDNs I worked with sometimes have surprisingly uncomplicated redundancy systems (sometimes just a handful of small routers they are very much ready to power down to cut over to backup paths or bring up new paths) and often they do not use the more complicated methods.

The catch with complicated redundancy is there is always a very close relationship or protocol or something between redundant components, bet it storage systems, network systems, anything. Inevitably a system goes down or loses its mind and takes it's redundant peers with it.... every new system you introduce is one more piece that could reach out and take everyone else with it. I saw it time and again, and again...

3 comments

I’ve seen overengineered and undermaintained HA systems result in much lower uptimes than a simple system with multiple SPOFs. I’ve seen well built and maintained HA systems fail under “rare” edge cases.

I’ve also seen well built and maintained HA systems work exactly as desired.

As a general rule, the cost of building and operating a reliable HA solution is not 2x, but at least 10x. If the system being protected is not worth that, you’ll very likely find the MTTR acronym far easier to catch than the rather more slippery HA.

Completely agree.

My home network is built with Mikrotik kit which is priced where it's affordable to have spares. I have yet to encounter a failure, but could drop in a new router in a couple of minutes with the saved configs.

I have SNMP monitoring feeding from telegraf into influxdb on an RPI. Dashboard rendered with Grafana on PC. Also have telegraf pinging to all 24x7 devices and collecting data from electricity meter, smartplugs, and Nests. It's been fun to do.

What advantage does that offer over something like LibreNMS which will do everything ?
Would you consider doing a write-up of how you set this up?
then you're not building your redundant systems properly.

Web, Power, Internet, Network, Military systems at scale use reliable redundancy and work w/ very little downtime.

The key part of redundancy is that your "redundancy glue"[1] must be significantly more reliable than each component, including its software and implementation -- because often the glue failing in isolation itself can cause outages. So the probability of failure was simply P(single failure); now for 2x parallel redundant systems it is P(single failure)^2 + P(glue failure). If P(single failure)^2 ~ 0, we need P(glue failure) < P(single failure), at the very least.

[1] i.e. the systems that interconnect the multiple redundant system, detect failures, redirect traffic, etc.

Very similar to the 'infrastructure as code' story, where you're still left with the construction and maintenance of the infrastructure that bootstraps the infrastructure as code systems.

Turtles all the way down, I guess.

> Turtles all the way down, I guess.

Indeed it is important in this case of course that this does not happen :) To see the increased reliability and P(glue failure)<P(single failure) you need to assure the glue systems are very simple and well built -- and preferably they need to be much smaller than the system you're protecting.

Another adequate expression to apply here is

"Who watches the watchmen?"

The answer again is the watchmen must watch themselves and be very reliable.

On this topic I recommend von Neumann's (the brilliant mathematician) "Computer and the brain" book, where he explores how computing systems can be reliably interconnected and how those failure probabilities interact. He was interested on how the brain could be so robust to failure -- don't worry there's no time spent speculating on how the brain works, instead he derives from first principles properties of reliable computing components, and possible reliable designs (the brain's unknown internal workings at the time, and now to a lesser extent, would follow as a special case). He used this same approach in analyzing the principles of life, where he came up with a self-replicating machine with a tape encoding of itself, predating the discovery of DNA -- it's a very inspiring and powerful approach. Unfortunately he could not complete 'Computer and the Brain', he was in declining health due to cancer and died while writing it. What was left is still very interesting imo. He is one of those giants whose shoulders we can sit on to peek over the horizon :)

Thank you.

As a caution against tenanting the deployment tools in-band, I'm reminded of an incident I witnessed about five years back. Company was moving their compute from on-prem to colo datacenters. Pretty good, mature setup: Almost entirely virtualized, 10Gb iSCSI SAN, credentials managed via a dedicated COTS tool, etc. They got most things over-the-wire to the DC. But the final migration had to be done cold - Shut the last bits down that were keeping everything running, move them to the DC and power back on.

Everything went very well until the SAN wouldn't come up. To get into the SAN and troubleshoot they needed the domain, which wasn't available. They had a local account on the SAN, the key for that was safely stored in the password manager. Which was a virtual machine. On the hyper visors. That wouldn't come up until the SAN was booted. Oops!

OK, that's a very obvious foot-in-mouth, in hindsight. As a more likely example, how about the Amazon S3 outage a few years back that wasn't reported on the status page, because the images for the status page were stored on... S3 :D

>you need to assure the glue systems are very simple and well built -- and preferably they need to be much smaller than the system you're protecting.

Absolutely agree.

Certainly it's possible to build redundant systems properly. But it's expensive. All the well-built redundant systems you listed understand that and budget for it.

Most half-baked redundant systems I've seen are a result of "I want four nines, but I only want it to cost 20% more than a two or three nines solution" type thinking.