Hacker News new | ask | show | jobs
by ubernostrum 3605 days ago
LAX still has electricity but delta's poor infrastructure planning has left a single point of failure in the deep south affecting flights everywhere in the world.

The "all over the world" bit is why there's a single point of failure.

A while back there was a story that made the rounds of aviation geeks, about Delta flying an empty 747 to Korea. That was to replace a 747 which had been badly damaged by hail, to the point that it would be unable to operate its scheduled return flight to the US.

Do you want to guess how many people, parts and places were involved in the "simple" task of dealing with this problem?

At first, the replacement 747 is in storage in a "boneyard" facility in Arizona, due to having been recently retired. So first it has to be pulled from the storage facility, put through basic airworthiness checks and fueled up, and then a flight crew has to be present to fly it to a Delta hub where it can be readied for a trans-Pacific flight.

The hub in question turned out to be Minneapolis. There, the plane has to undergo more work to get it ready for a long flight, and now multiple flight crews have to be present, since they need to rotate in and out over the duration of the flight (that's how you do long flights). Oh, and Minneapolis isn't normally a 747 base; it only gets them during peak travel seasons and on the occasional charter. So crews probably have to be brought in, stores and maintenance setups need to be brought online, etc.

Then the plane can -- finally -- fly out to replace its damaged counterpart, pick up any stranded passengers and bring them to the US. Which will mean flying into yet another hub, since the flight doesn't go back to Minneapolis.

Meanwhile the damaged plane is still sitting there in Korea, and needs to be repaired on-site to get it into minimum airworthy condition to fly home (empty of passengers). It's going to need parts, maintenance crew, flight crews, etc. just like the replacement plane did.

And the deeper you dig the more stuff you'll find like this. Running an airline with global, or even national-across-the-US, service is not something you can decentralize to avoid problems at one operations center. The amount of coordination just of people, parts and planes across widely disparate locations requires centralized operational control instead of devolved regional centers with high autonomy.

1 comments

Absolutely none of that is an excuse for having an entire airline dependent on a single data center. It's about redundancy. You centralize administration, not the control plane itself. Quorum in database systems, load balancers, and DNS updates solved these problems a long time ago.

At this point I consider a company as large as this having such a rudimentary single point of failure to be incompetence in the IT department. We wouldn't be so forgiving if delta needlessly kept all of its pilots in one city during the night so a single storm wiped out every flight.

You centralize administration, not the control plane itself.

You're still out of luck when the centralized admin center goes down, though. That's the place that is the source of all the humans performing the coordination and dispatching work. Having a bunch of extra data centers and backup generators around the country will not cause those humans to become accessible.

And building out full redundant continuity of everything, including the humans, is not something that tends to happen outside of major governments.

That's not what happened! It's the computer system that failed. The entire administrative team didn't just up and die.

Also, "centralize administration" just means that you can control everything from a single location. It doesn't preclude being able to control from multiple locations.

Think of AWS, you can control everything across multiple data centers from a centralized interface from anywhere with an Internet connection, even if entire data centers go down.

A sane system should essentially allow delta to operate from many possible locations seamlessly as long as they have the human operators required.