Hacker News new | ask | show | jobs
by klodolph 2426 days ago
Both A and B are known to be buggy, it is a fact. A is crashing. B is causing A to crash. A should not crash in the presence of B, that is a bug. B should not cause A to crash, that is a bug.

You cannot make the decision to roll forward on the basis that you guess that rolling forward to B is stable. You don’t have any evidence for that—just a guess. However, there is solid evidence that rolling backward to A is stable—it has been running that way for a while.

You must make the decision with imperfect knowledge. The correct answer is to roll back.

1 comments

Except client A never crashed when receiving data from server A.

You can rightly argue that client A was irresponsible in not protecting itself more intelligently against invalid or unexpected inputs (dumping core is the crudest, bluntest protection there is; not the best choice in a distributed environment where transmission errors are a fact of life); but the system as a whole worked.

Had server B continued to supply client A with data in the format that client A expected to receive, no crash would have occurred and the entire rolling upgrade would have gone without a hitch. But no, the lazy irresponsible corner-cutting assholes couldn’t be arsed doing that; they just start pushing the new data to everyone, and then blame everyone but themselves when that goes sideways.

“The correct answer is to roll back.”

The correct answer is never to get into a state where rollback becomes necessary. Though having failed to do that, and so ended up in exactly this state, immediate rollback of B to A may well have been the next-best response, followed by system audit to determine what integrity/data loss has occurred and post-mortem of the procedures used, and subsequent corrections so that it doesn’t happen again.

But if you think a bunch of cowboys who were only too happy to shirk their responsibilities during the (private) development phase are suddenly going to own up and accept personal liability when it blows up in the (public) rollout phase, then boy, do I have an eight-figure Enterprise-y grade bridge to sell you.

> The correct answer is never to get into a state where rollback becomes necessary.

We occasionally roll out bad software. I know of no reasonable set of practices which can avoid it.

I honestly don’t understand how you would expect to make this possible without an obscene budget + insanely slow pace of development.

“We occasionally roll out bad software.”

Honestly, as a profession we very rarely roll out anything else. It’s one of the reasons we should be designing systems and procedures that are fault-tolerant from the very start.

In this particular case study, a very small, simple, obvious, cheap step (enabling server B to talk to clients A as well as clients B) was not taken during design and development, nor caught addressed during testing, resulting in very large, complicated, costly failure which the culprits then tried to mask instead of owning it like the professionals they’re supposed to be.

That so many other techies should automatically switch into CYA mode when it’s not even their own fuckup that we’re talking about here is a damning indictment of modern developer culture’s attitude toward professional responsibility and personal liability.

It’s excuses all the way down.