Hacker News new | ask | show | jobs
by protomyth 2458 days ago
I get the feeling that whoever writes the post-mortem is going to have a bit of pressure to assure folks that there is isolation going forward.
2 comments

That would be a bad sign that there’s something wrong with the culture. I would hope for a postmortem that identified flaws that genuinely needed to be fixed.
Those are not mutually exclusive and actually a good idea. You want to fix this specific issue, but also ensure that whatever process took down one DC doesn't affect other DCs. That's scaling and redundancy 101 - not sure why it would be something wrong.
> Those are not mutually exclusive and actually a good idea.

The goals “assuring folks that there is isolation” and “identifying flaws that need to be fixed” are somewhat contrary to each other.

The post-mortem should identify flaws in systems, processes, and thinking. It should not try to assure people that there is isolation when there is evidence to the contrary.

> You want to fix this specific issue, but also ensure that whatever process took down one DC doesn't affect other DCs.

This was a multi-regional failure. So, this specific issue is also an isolation problem, among other things. You will want to ensure that this problem doesn’t happen again but you shouldn’t assure that it won’t.

I would think having all zones go down is a flaw that genuinely needed to be fixed.
That’s not the flaw, that’s the outcome. The purpose of a post-mortem is to identify the flaws that caused that outcome, and ways to fix those flaws.
Whoever broke it is going to feel significant pressure to actually isolate things too.