Hacker News new | ask | show | jobs
by debt 2539 days ago
"We identified that our rolled-back election protocol interacted poorly with a recently-introduced configuration setting to trigger the second period of degradation."

Damn what a mess. Sounds like y'all are rolling out way to many changes too quickly with little to no time for integration testing.

It's a somewhat amateur move to assume you can just arbitrarily rollback without consequence, without testing etc.

One solution I don't see mentioned, don't upgrade to minor versions ever. And create a dependency matrix so if you do rollback, you rollback all the other things that depend on the thing you're rolling back as well.

2 comments

Yes this was very surprising. The system was working fine after the cluster restart. There was no need for an emergency rollback.

Doing a large rollback based on a hunch seems like an overreaction.

It's totally normal for engineers to commit these errors. That's fine. The detail that's missing in this PM is what kind of operational culture, procedures and automation is in place to reduce operator errors.

Did the engineer making this decision have access to other team members to review their plan of action? I believe that a group (2-3) of experienced engineers sharing information in real-time and coordinating the response could have reacted better.

Of course, I wasn't there so I could be completely off.

"That's fine."

idk the suits have a very different viewpoint; 30 minutes of downtime for a large financial system isn't fine. it can be very costly.

I think the GP means that as far as incidents occurring, so far as care is (or was) taken to prevent them and learn from them, then that's all one can really reasonably ask for. The first incident falls under that heading and 'is fine' in a 'life happens' sense.

The following incident comes across as reckless and avoidable as there should have been procedures to safely test the rollback (and perhaps there were, but a perfect storm allowed it fail in prod). Lacking details about how the second incident came to be or how they will be prevented going forward places the second incident as 'not fine'.

This information is what the GP comment is asking for.

Compare this PM with Cloudflare's PM, where they detail how they tested rules, what safeguards were in place, how the incident came to be, and how they intend to prevent similar incidents; the impression given here is that they will put up more fire alarms and fire extinguishers but do little fire prevention.

Not sure why this is downvoted but it all really looks like non-tested deployments to production servers.
Possibly downvoted because of the name-calling ('what a mess', 'amateur move'), which degrades discussion and is against the site guidelines. It's also sort of distasteful to pile on like that.

https://news.ycombinator.com/newsguidelines.html