Hacker News new | ask | show | jobs
by 2manyredirects 3316 days ago
One of the unions has been quick to attribute the issue to the outsourcing to India of some of the IT responsibility, which the right wing press here has been all too eager to publish, but BA have rebuffed this and said at this stage they believe the root cause was a power supply issue - sure, that could be attributed somewhere along the lines to an 'alpha male business asshole' (as I read in one of the comments here), but it's probably best to wait and see what the post-mortem really is rather than seek to blame someone, somewhere, be it a businessman, an Indian dev team or anything else.

I am reminded of a post a while back regarding AWS' issues affecting multiple data centres (I forget the specifics), and how their post mortem didn't appropriate blame on anyone (which it really easily could have), but rather their own checks and balances, which allowed the issue to arise in the first place. I do hope that when the dust settles we see a measured response rather than a witch hunt.

4 comments

I've found that not only is it good to not assign blame in postmortems, but it's also accurate: The culprit usually is the checks and balances, as mistakes will happen, and the goal should be to have failsafes and detection.

I'm reminded of airplane accidents: Whenever you hear of an airplane accident, it's always some amazingly crazy series of things going exactly wrong to get the plane to crash. We have a tendency to think "wow, what bad luck", but a better way to think about it is that airplanes are so safe that an accident' can't occur unless a whole series of things go very specifically wrong.

A company's goal should be to increase the number of necessary things that need to all go wrong before there is downtime.

One other important point is that the very term "root cause" is extremely harmful in that it presumes a primary failure and already seeds the idea of one bad actor and, by proxy, blame. Systems today are too complex to blame upon one or two things - we operate in a very complicated, "complected" world both in our software and in many organizations.

While there are always technical causes for larger technical failures, I've seen far too many times RCA post-mortems performed that result in witch hunts instead of a solemn contemplation of how things could be better done by everyone. Such an RCA may ignore that a normally careful engineer was overworked by managers, never is lack of relevant monitoring and testing due to budget cuts cited, and you'll certainly never see "teams X and Y collaborated too much" as a reason for failure in these places. Because in a typical workplace, the company's values and culture are never related to a failure. You can't objectively measure how bad or how good a culture is either. Why make it part of post mortems when you don't think it's a failure?

I don't recall ever having a manager use the term root cause analysis in the way you are implying. Usually we are looking for the cheapest or most effective process change that will prevent that class of problem happening again.
> but a better way to think about it is that airplanes are so safe that an accident' can't occur unless a whole series of things go very specifically wrong.

As an aside, I met someone who was working on a graph theory problem as their research project, and the application was that you could model the entire process of aircraft control through a state machine using that graph. Effectively they are working on making it mathematically impossible for a crash to occur assuming that a certain process is followed (with safety measures ofc).

> assuming that a certain process is followed

The challenge is to avoid pushing all the risk into that assumption. It's easy enough to build a system that never breaks if you're willing to assume perfect behaviour on the part of its dependencies, environment, users and operators.

If you've seen how aircraft controllers and pilots work, I think that "following the rules" is a very fair assumption to make. But ignoring that, obviously if implemented there would be fail-safes.

> It's easy enough to build a system that never breaks if you're willing to assume perfect behaviour

It isn't though. Seriously, think about how you could safely route several thousand flying hunks of metal through fairly small air corridors (which all have inertia) and you need to maintain strict flight schedules. Then think about how you need to factor in all of the edge cases caused by emergencies on planes (these are all included in the process for flight controllers). Then think how you could mathematically prove that safety.

Yes, it's easier if you assume that people will follow a certain process (and actual flight systems have so many layers of fail-safes that it's ridiculous) but it's definitely not "easy enough".

If an error makes it into production it is always process, never an individual, even if the individual involved was malicious. The only thing you can do with errors is fix them, learn from them, and then fix the process too.

Assigning blame does not move the needle at all.

Their system should be built in a way which makes it resistant to a power supply issue. That is the fault of however built it, whether onshore or off.
Yeah. I'm sure the "power supply issue" is an overly simplistic explanation -- it's highly unlikely a single computer PS could cause such major issues in such a huge organization.

That said, having almost entirely dodged any outsourcing-related issues in the 90s, and worked with generally great offshore teams, seeing my current role impacted by an utterly shortsighted and ignorant attempt to offshore critical operations tasks is quite disheartening. It's rarely the fault of the teams, it's the fault of higher-ups who completely fail to grasp the complexity and consequences of the tasks they're offloading. Everything looks great for a few weeks or months or years until one of the dozens of things that have gone neglected rear their ugly heads. If they're lucky, they take the money and run before the crash happens and escape most blame.

I was worked on a site where a UPS failure took out a large AS400, the business stopped for three days while they waited for IBM to replace it.
Hey there! "alpha male business asshole" theorist checking back in on this. While not a slam-dunk yet, it's looking like the theory of alpha-male business asshole looking to advance career at expense of company and deflection of blame onto operators of business is gaining some credence!

https://www.thesun.co.uk/news/3671697/ba-travel-chaos-dodgy-...

“We started using the new system in October. Training aside, the whole thing has been a disaster."

“It breaks my heart to see this as I love this company but it is really going down the pan."

“It’s got so bad that some staff members have written to the transport secretary Chris Grayling. All of our concerns have fallen on deaf ears."

“The Chief Executive Alex Cruz, when he was warned about the system told us that it was the staff’s fault not the system."