| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by SpaethCo 2906 days ago

Whenever I see solutions like this I think back to an org I worked at where a high-visibility day-long database outage gained upper level management attention. The response, after the managers talked to our vendor (IBM), was to re-architect everything to use HACMP clusters for all of our production databases company-wide.

That was followed by a couple years of 100+ hour/year cumulative outages due to HACMP stability issues, and an environment that everyone was deathly afraid to touch.

The hardcore network engineer in me appreciates the detail in these kinds of solutions, but these days the practical side of me is satisfied with usability and maintainability of SPOF cable access with a manual failover to mobile hotspot on the rare occasions that drops offline.

6 comments

duxup 2906 days ago

Former network engineer here, can confirm. Time and again I've seen redundant systems create their own problems where without all that extra complexity things would have been fine.

Even ISPs and CDNs I worked with sometimes have surprisingly uncomplicated redundancy systems (sometimes just a handful of small routers they are very much ready to power down to cut over to backup paths or bring up new paths) and often they do not use the more complicated methods.

The catch with complicated redundancy is there is always a very close relationship or protocol or something between redundant components, bet it storage systems, network systems, anything. Inevitably a system goes down or loses its mind and takes it's redundant peers with it.... every new system you introduce is one more piece that could reach out and take everyone else with it. I saw it time and again, and again...

darkr 2906 days ago

I’ve seen overengineered and undermaintained HA systems result in much lower uptimes than a simple system with multiple SPOFs. I’ve seen well built and maintained HA systems fail under “rare” edge cases.

I’ve also seen well built and maintained HA systems work exactly as desired.

As a general rule, the cost of building and operating a reliable HA solution is not 2x, but at least 10x. If the system being protected is not worth that, you’ll very likely find the MTTR acronym far easier to catch than the rather more slippery HA.

MrZipf 2906 days ago

Completely agree.

My home network is built with Mikrotik kit which is priced where it's affordable to have spares. I have yet to encounter a failure, but could drop in a new router in a couple of minutes with the saved configs.

I have SNMP monitoring feeding from telegraf into influxdb on an RPI. Dashboard rendered with Grafana on PC. Also have telegraf pinging to all 24x7 devices and collecting data from electricity meter, smartplugs, and Nests. It's been fun to do.

qmr 2904 days ago

What advantage does that offer over something like LibreNMS which will do everything ?

dpcx 2905 days ago

Would you consider doing a write-up of how you set this up?

late2part 2906 days ago

then you're not building your redundant systems properly.

Web, Power, Internet, Network, Military systems at scale use reliable redundancy and work w/ very little downtime.

darkmighty 2906 days ago

The key part of redundancy is that your "redundancy glue"[1] must be significantly more reliable than each component, including its software and implementation -- because often the glue failing in isolation itself can cause outages. So the probability of failure was simply P(single failure); now for 2x parallel redundant systems it is P(single failure)^2 + P(glue failure). If P(single failure)^2 ~ 0, we need P(glue failure) < P(single failure), at the very least.

[1] i.e. the systems that interconnect the multiple redundant system, detect failures, redirect traffic, etc.

FooHentai 2905 days ago

Very similar to the 'infrastructure as code' story, where you're still left with the construction and maintenance of the infrastructure that bootstraps the infrastructure as code systems.

Turtles all the way down, I guess.

darkmighty 2904 days ago

> Turtles all the way down, I guess.

Indeed it is important in this case of course that this does not happen :) To see the increased reliability and P(glue failure)<P(single failure) you need to assure the glue systems are very simple and well built -- and preferably they need to be much smaller than the system you're protecting.

Another adequate expression to apply here is

"Who watches the watchmen?"

The answer again is the watchmen must watch themselves and be very reliable.

On this topic I recommend von Neumann's (the brilliant mathematician) "Computer and the brain" book, where he explores how computing systems can be reliably interconnected and how those failure probabilities interact. He was interested on how the brain could be so robust to failure -- don't worry there's no time spent speculating on how the brain works, instead he derives from first principles properties of reliable computing components, and possible reliable designs (the brain's unknown internal workings at the time, and now to a lesser extent, would follow as a special case). He used this same approach in analyzing the principles of life, where he came up with a self-replicating machine with a tape encoding of itself, predating the discovery of DNA -- it's a very inspiring and powerful approach. Unfortunately he could not complete 'Computer and the Brain', he was in declining health due to cancer and died while writing it. What was left is still very interesting imo. He is one of those giants whose shoulders we can sit on to peek over the horizon :)

FooHentai 2902 days ago

Thank you.

As a caution against tenanting the deployment tools in-band, I'm reminded of an incident I witnessed about five years back. Company was moving their compute from on-prem to colo datacenters. Pretty good, mature setup: Almost entirely virtualized, 10Gb iSCSI SAN, credentials managed via a dedicated COTS tool, etc. They got most things over-the-wire to the DC. But the final migration had to be done cold - Shut the last bits down that were keeping everything running, move them to the DC and power back on.

Everything went very well until the SAN wouldn't come up. To get into the SAN and troubleshoot they needed the domain, which wasn't available. They had a local account on the SAN, the key for that was safely stored in the password manager. Which was a virtual machine. On the hyper visors. That wouldn't come up until the SAN was booted. Oops!

OK, that's a very obvious foot-in-mouth, in hindsight. As a more likely example, how about the Amazon S3 outage a few years back that wasn't reported on the status page, because the images for the status page were stored on... S3 :D

>you need to assure the glue systems are very simple and well built -- and preferably they need to be much smaller than the system you're protecting.

Absolutely agree.

evil-olive 2906 days ago

Certainly it's possible to build redundant systems properly. But it's expensive. All the well-built redundant systems you listed understand that and budget for it.

Most half-baked redundant systems I've seen are a result of "I want four nines, but I only want it to cost 20% more than a two or three nines solution" type thinking.

stevbov 2906 days ago

Reminds me of what my brother in law says: I don't want to be stuck doing tech support for my family.

With my luck, it would catastrophically fail while out of town, leaving the wife and kids without internet.

My dad set up a lot of complicated stuff like this. As people are prone to do, eventually he died, and it just made it difficult to troubleshoot technical problems for mom. So now the equipment sits in some corner, unused, because we replaced it all with something your average AT&T technician could troubleshoot.

isostatic 2906 days ago

> With my luck, it would catastrophically fail while out of town, leaving the wife and kids without internet.

Two ISPs, two networks. One called "main", one called "backup".

If "Main" fails, move over to "Backup", either with a cable, or on a different SSID.

jerrysievert 2906 days ago

Where in some cases, the "Backup" is tethering with a smart-phone.

JohnJamesRambo 2906 days ago

Are you advocating buying internet service from two different companies and paying for both every month in case one fails for a brief period of time?

sangnoir 2906 days ago

> Are you advocating buying internet service from two different companies and paying for both every month in case one fails for a brief period of time?

That's not an unreasonable solution, considering most people already pay two ISPs (one fixed, and another for their phone/tablet). When your home wifi goes down, you're going to fall-back to your mobile anyway. I'm thinking of getting an extra data SIM, an LTE modem and do auto-failover.

--edit--

My needs are somewhat unique - my traveling laptop is on its last legs (and will be replaced by a cheap chromebook. Desktops/servers get better bang for the buck compared to laptops. Go figure!), so I tunnel onto a server at home for heavy-lift computing. If the internet fails when I'm not home, I'd be left stranded (and this has happened).

giancarlostoro 2905 days ago

In my case my Surface Book 2 gives me all the firepower I need to not miss my desktop, and it also has a PCIE SSD on it like my desktop. I do agree, sometimes tethering is highly useful, at least in my case on my laptop. I try to keep as many things as offline capable as possible.

jpk 2906 days ago

That's literally what the author of the article describes.

From a practical point of view I think it's silly to do such a thing for a residential situation, but I can appreciate using it as a learning experience for building systems like this.

isostatic 2906 days ago

Depends how reliable your isp is ans how much it costs if it goes down.

3g is good enough backup for me, but for the office we go for two routers two isps and vrrp on the lan side, load balance across the wans, with failover to the other one.

megous 2906 days ago

To be fair, mom probably will not be migrating VMs across three different supermicros and managing a ceph cluster to get a wifi connection.

I would not discount the possibility completely. But I judge it unlikely.

isostatic 2905 days ago

If I wanted a seemless non-SPOF network for my family, I'd put in two mikrotiks, with the primary on mains, and secondary on UPS, £120 for a pair to do routing at a decent (1gig) speed on the main, and built in 4G on the reserve.

Then I'd put the primary router on the wired line, the other one on a 4G sim which did nothing but heartbeats unless the wired line went down. If the wired line shut down, traffic would reroute via 4G within 10 seconds or so. If the primary router went down, the backup router would take over in a similar time frame. Might put some capping on the 4G router to the netflix/etc boxes to keep bandwidth costs down.

UPS would be about 10W, so £45 for a 4 hour one. Possibly look at renewable energy of some sort to keep the UPS going during an extended outage.

I'd then VRRP on the lan side with primary on the main router (which would have a backup route via the secondary router)

Cloud based VM to do monitoring/alerting and land outgoing openvpn tunnels from both routers to allow secure remote access.

£170, £10 a month plus main ISP, and an hour of config.

However in reality having an ISP provided router and showing them how to tether in a problem works fine. OK, they lose their devices if the main circuit goes off, but running those over 4G can be pricey.

larkeith 2906 days ago

There's a reason Arthur C Clarke's short story Superiority was once required reading at MIT [1].

[1] https://en.wikipedia.org/wiki/Superiority_(short_story)

Aloha 2906 days ago

http://www.mayofamily.com/RLM/txt_Clarke_Superiority.html

Link to the story online.

ddalex 2905 days ago

EU would like to have a word with you.

Aloha 2905 days ago

Me, the person who put it online, or both of us?

cheeze 2905 days ago

According to the Wikipedia article, it was required reading for a specific course, no?

ghotli 2905 days ago

I had never been exposed to this. great read. thanks

nerpderp83 2906 days ago

This was actually a case study from when Clarke was an MBA intern at Google.

Latteland 2906 days ago

I'm pretty sure the sci fi write Clarker was never an MBA intern at Google, He'd have been 73 in 2000. Plus he was a scientist, now a biz person.

bscphil 2905 days ago

I'm sure GP was joking.

>because of its own organizational flaws and its willingness to discard old technology without having fully perfected the new.

hessart 2906 days ago

Maybe.

Latteland 2905 days ago

I meant "not a biz person", instead of 'now a biz person', but I can't edit the original posting.

DoubleGlazing 2906 days ago

I used to work for a company whose setup was super simple.

ADSL Modem > Firewall > Router > Web/DB servers

It was basic, but it worked. Our web servers were mission critical, but as a B2B business they, and the ADSL connection, didn't sustain a heavy load. The only issues we had over several years were with the ADSL modem. Everything else just worked.

When we moved office we moved our servers to a co-hosting centre with an upgraded network setup with all sorts of backup and redundancy. Every week something went wrong. Sometimes simple is best.

aidos 2906 days ago

I worked at a place that hosted the servers in-house. They even built a special little air-conditioned room and put a generator on the roof. I never knew all the details but there was dual everything, 2 lines coming in, stuff to switch between them, nothing could possibly go wrong... until the day it did. Turns out someone has plugged all the machines into a single extension cable, and the fuse popped.

walshemj 2906 days ago

Even the big boys do that in the big storm of 87 in the uk Telecom Gold (an early online service) was quite proud that the UPS kicked in - only to realize that the modems that linked to the x.25 network where not on the UPS :-)

madmulita 2905 days ago

My anecdata: I used to admin a SWIFT cluster. It was built by the manuals on IBM hardware, that included HACMP with quorum determined by a shared disk.

Nobody understood exactly how the cluster worked to the point that a correction my boss made on the physical connections, made us loose a couple of million of dollars in transactions not processed.

The funny part is, when the cluster was working fine, a takeover took at least 20 minutes. During that time nothing was "available". The thing is, no matter what, SWIFT Alliance took that time to properly close and open the DB.