Hacker News new | ask | show | jobs
by SentinelRosko 955 days ago
I would generally agree with you, but this post mortem was 75% blaming Flexential even though it took them almost two days to recover after power was restored. The power outage should have been a single paragraph and then pivoted - DC failures happen, its part of life. Failing to properly account for and recover from it is where the real learnings for Cloudflare are.
1 comments

It was more of an incident report. The efforts to get back online were mostly around Flexential, so it makes sense to dive in to their failings. That said, it is clear there were major lapses of judgement around the control plane design since they should be able to withstand an earthquake. That they don't have regular disaster recovery testing of the control plane and its dependencies seems crazy. I wonder if it is more that some of those dependencies they hoped to eliminate and replace with in-house technology and hedged their bets on the risk.