Hacker News new | ask | show | jobs
by cetico 2536 days ago
Yes this was very surprising. The system was working fine after the cluster restart. There was no need for an emergency rollback.

Doing a large rollback based on a hunch seems like an overreaction.

It's totally normal for engineers to commit these errors. That's fine. The detail that's missing in this PM is what kind of operational culture, procedures and automation is in place to reduce operator errors.

Did the engineer making this decision have access to other team members to review their plan of action? I believe that a group (2-3) of experienced engineers sharing information in real-time and coordinating the response could have reacted better.

Of course, I wasn't there so I could be completely off.

1 comments

"That's fine."

idk the suits have a very different viewpoint; 30 minutes of downtime for a large financial system isn't fine. it can be very costly.

I think the GP means that as far as incidents occurring, so far as care is (or was) taken to prevent them and learn from them, then that's all one can really reasonably ask for. The first incident falls under that heading and 'is fine' in a 'life happens' sense.

The following incident comes across as reckless and avoidable as there should have been procedures to safely test the rollback (and perhaps there were, but a perfect storm allowed it fail in prod). Lacking details about how the second incident came to be or how they will be prevented going forward places the second incident as 'not fine'.

This information is what the GP comment is asking for.

Compare this PM with Cloudflare's PM, where they detail how they tested rules, what safeguards were in place, how the incident came to be, and how they intend to prevent similar incidents; the impression given here is that they will put up more fire alarms and fire extinguishers but do little fire prevention.