|
|
|
|
|
by cetico
2536 days ago
|
|
Yes this was very surprising. The system was working fine after the cluster restart. There was no need for an emergency rollback. Doing a large rollback based on a hunch seems like an overreaction. It's totally normal for engineers to commit these errors. That's fine. The detail that's missing in this PM is what kind of operational culture, procedures and automation is in place to reduce operator errors. Did the engineer making this decision have access to other team members to review their plan of action? I believe that a group (2-3) of experienced engineers sharing information in real-time and coordinating the response could have reacted better. Of course, I wasn't there so I could be completely off. |
|
idk the suits have a very different viewpoint; 30 minutes of downtime for a large financial system isn't fine. it can be very costly.