|
|
|
|
|
by debt
2539 days ago
|
|
"We identified that our rolled-back election protocol interacted poorly with a recently-introduced configuration setting to trigger the second period of degradation." Damn what a mess. Sounds like y'all are rolling out way to many changes too quickly with little to no time for integration testing. It's a somewhat amateur move to assume you can just arbitrarily rollback without consequence, without testing etc. One solution I don't see mentioned, don't upgrade to minor versions ever. And create a dependency matrix so if you do rollback, you rollback all the other things that depend on the thing you're rolling back as well. |
|
Doing a large rollback based on a hunch seems like an overreaction.
It's totally normal for engineers to commit these errors. That's fine. The detail that's missing in this PM is what kind of operational culture, procedures and automation is in place to reduce operator errors.
Did the engineer making this decision have access to other team members to review their plan of action? I believe that a group (2-3) of experienced engineers sharing information in real-time and coordinating the response could have reacted better.
Of course, I wasn't there so I could be completely off.