Hacker News new | ask | show | jobs
by klodolph 2426 days ago
It absolutely can be second-guessed, because this is a very common scenario and there is a consensus about the default way to deal with it—roll back first, ask questions later. It’s easier to fix bugs while production services are not on fire.

If you are seeing failures and want to roll forward, you should be able to clearly articulate why this is better than rolling back, and what makes this particular scenario different.

Otherwise, I’m going to be on the side of the armchair observers calling for rollbacks.

2 comments

“roll back first, ask questions later”

Better yet: build code and processes appropriate for long-lived massively distributed systems that will be incrementally upgraded over time. If the system is architected right, it will never get into a state where a rollback recovery becomes necessary. This is why we have Content Negotiation. This is why we have Erlang. This is not a new challenge by decades, and there is a huge body of expert knowledge and tools upon which to draw when implementing such systems, so any such complete catastrophic basic failures now are entirely down to PEBKAC, and remedied by a swift clue-by-four with a pink slip nailed to the end.

There is a very simple principle underpinning distributed communication: servers should never make assumptions about who their clients are and what they need. Talk to the client, find out what format(s) it’s willing/able to accept, and serve it the best match. A client should never need to know, nor care, if it’s talking to a version A server or a version B server: if the client says “I only understand version A data” then it’s the server’s job to serve up data in that exact format, not to pique and whine about how old and out of date the client is, push it version B data instead, and then blame the client for choking on it.

Indolent developers who approach IPC the same as local messaging and then blame everything but themselves when it barfs all over the place are the absolute bane of this industry, and this shit is entirely on them. And shame on the equally inept management culture that continues to let such incompetent amateurs get away with it.

You’re asking too much.

There will be bad rollouts. I know of no set of practices which prevent bad rollouts. You talk about “indolent web developers”, well, that’s not productive and pointing fingers doesn’t make your software work. Your software will, in spite of your best practices, in spite of hiring the best people, in spite of experience, sometimes fall over.

Yes, it will sometimes segfault.

My software shits itself all the time. What matters is that it does so safely. And when it doesn’t, I can tell you why, because I know what corners have been cut and why, and I’m not afraid to accept and acknowledge my responsibilities in such fuckups.

And yeah, I count on the fingers of one hand the number of web developers I’ve dealt with over the last decade who I’d be willing to cross the road to piss on were they on fire, and still have fingers to spare. They’re just the worst of the worst.

There was NO excuse for the failure described in the article. There was NO excuse for the described response to that failure. Yet such base incompetence and gross irresponsibility is not only systemic but entrenched, rationalized, and embraced in this industry. With responses like yours, it’s not hard to tell why. Buncha Children.

> There was NO excuse for the failure described in the article.

In this case, right. In general, stuff happens. There’s a tradeoff between reliability and effort. The correct reliability target is not 100%, because you can’t get 100% anyway, and as you approach 100% reliability the cost increases without bound.

I’m not sure what the rest of your comment is about besides taking a big shit on web developers and talking about how awful they are.

There is a precious small percentage of developers who are really good at making reliable systems and they have the burden / responsibility of spreading their knowledge. They work with the other actual developers you hire, those beautiful imperfect developers who cut corners, test in production, and don’t write tests.

You make changes to your culture and your practices. You build monitoring and rollout automation. You increase test coverage.

If you just call people children you’re going to be there, on the sidelines, watching other people build real products. You don’t teach people by making fun of them.

> I count on the fingers of one hand the number of [X], and still have fingers to spare

So many words to non precisely say one to three (assuming a five finger hand).

In this case, it sounds like a rollback (killing B, forcing all B-clients to reconnect to A) is just as damaging as continuing to deploy B (forcing all A-clients to reconnect to B). The tiebreaker for favoring of B would be A is already dead and you'll have to bring some As back. Tiebreaker favoring A would be other unknowns introduced in B that may be problematic.

Plus, we don't know if the bug resides with A or B. Maybe B is triggering a previous unknown bug in A instead of B being buggy.

Both A and B are known to be buggy, it is a fact. A is crashing. B is causing A to crash. A should not crash in the presence of B, that is a bug. B should not cause A to crash, that is a bug.

You cannot make the decision to roll forward on the basis that you guess that rolling forward to B is stable. You don’t have any evidence for that—just a guess. However, there is solid evidence that rolling backward to A is stable—it has been running that way for a while.

You must make the decision with imperfect knowledge. The correct answer is to roll back.

Except client A never crashed when receiving data from server A.

You can rightly argue that client A was irresponsible in not protecting itself more intelligently against invalid or unexpected inputs (dumping core is the crudest, bluntest protection there is; not the best choice in a distributed environment where transmission errors are a fact of life); but the system as a whole worked.

Had server B continued to supply client A with data in the format that client A expected to receive, no crash would have occurred and the entire rolling upgrade would have gone without a hitch. But no, the lazy irresponsible corner-cutting assholes couldn’t be arsed doing that; they just start pushing the new data to everyone, and then blame everyone but themselves when that goes sideways.

“The correct answer is to roll back.”

The correct answer is never to get into a state where rollback becomes necessary. Though having failed to do that, and so ended up in exactly this state, immediate rollback of B to A may well have been the next-best response, followed by system audit to determine what integrity/data loss has occurred and post-mortem of the procedures used, and subsequent corrections so that it doesn’t happen again.

But if you think a bunch of cowboys who were only too happy to shirk their responsibilities during the (private) development phase are suddenly going to own up and accept personal liability when it blows up in the (public) rollout phase, then boy, do I have an eight-figure Enterprise-y grade bridge to sell you.

> The correct answer is never to get into a state where rollback becomes necessary.

We occasionally roll out bad software. I know of no reasonable set of practices which can avoid it.

I honestly don’t understand how you would expect to make this possible without an obscene budget + insanely slow pace of development.

“We occasionally roll out bad software.”

Honestly, as a profession we very rarely roll out anything else. It’s one of the reasons we should be designing systems and procedures that are fault-tolerant from the very start.

In this particular case study, a very small, simple, obvious, cheap step (enabling server B to talk to clients A as well as clients B) was not taken during design and development, nor caught addressed during testing, resulting in very large, complicated, costly failure which the culprits then tried to mask instead of owning it like the professionals they’re supposed to be.

That so many other techies should automatically switch into CYA mode when it’s not even their own fuckup that we’re talking about here is a damning indictment of modern developer culture’s attitude toward professional responsibility and personal liability.

It’s excuses all the way down.