Hacker News new | ask | show | jobs
by austinkhale 955 days ago
I love how thorough Cloudflare post mortem’s are. Reading the frank, transparent explanations are like a breath of fresh air compared to the obfuscation of nearly every other company comm’s strategy.

We were affected but it’s blog posts like these that make me never want to move away. Everyone makes mistakes. Everyone has bad days. It’s how you react afterwards that makes the difference.

3 comments

I would generally agree with you, but this post mortem was 75% blaming Flexential even though it took them almost two days to recover after power was restored. The power outage should have been a single paragraph and then pivoted - DC failures happen, its part of life. Failing to properly account for and recover from it is where the real learnings for Cloudflare are.
It was more of an incident report. The efforts to get back online were mostly around Flexential, so it makes sense to dive in to their failings. That said, it is clear there were major lapses of judgement around the control plane design since they should be able to withstand an earthquake. That they don't have regular disaster recovery testing of the control plane and its dependencies seems crazy. I wonder if it is more that some of those dependencies they hoped to eliminate and replace with in-house technology and hedged their bets on the risk.
> Everyone makes mistakes. Everyone has bad days.

The issue is when you start having bad days every other day though. We use and depend on CloudFlare Images heavily, it has now been down more than 67 hours over the last 30 days (22h on October 9th, 42h Nov 2 - Nov 4 and a sprinkle of ~hour long outages in between). That's 90.6% availability over the last month.

Transparency is a great differentiator between providers that are fighting in the 99.9% availability range, but when you are hanging on for dear life to stay above the one 9 availability, it doesn't matter.

They are a younger company than these other providers. Microsoft, Google, and AWS had their own growth pains and disasters. Remember when Microsoft deleted all the data (contacts, photos, etc) off all their customers Danger phones by accident and had no backup. Talk about naming their product a self-fulfilling prophecy.
Cloudflare is 14 years old and Cloudflare Stream, the "newer services they didn't have time to make HA" is 6 years old today.
they are 14 years old at this point. aws has what, four years on them?
AWS was the public release of tooling that amazon had been bulding for almost 20 years at that point.

Similar story for GCP.

All three of them had decades of institutional knowledge and procedures in place around running big services by the time Cloudflare was founded.

> AWS was the public release of tooling that amazon had been bulding for almost 20 years at that point.

No, even at the onset AWS was an entirely-from-the-ground-up build. The only thing it could even be argued to sit on top of was the extremely crufty VMs and physical loadbalancers from the original Prod at that point, and those things were not doing anybody any favors.

No they didn't. Amazon was 12 years old when AWS launched. Google was 10 years old when GCP launched.
Cloudflare is fourteen years old
I agree, but I also think that for security purposes they should leave out extraneous detail. Also, I know they want to hold their suppliers accountable, but I would hold off pointing fingers. It doesn't really improve behavior, and it makes incentives worse.

I really appreciate that they're going to fix the process errors here. But as they suggested, there's a tension between moving fast and being sure. This is typically managed like the weather, buying rain jackets afterwards (not optimal). I'd be curious to see how they can make reliability part of the culture without tying development up in process.

Perhaps they can model the system in software, then use traffic analytics to validate their models. If they can lower the cost of reliability experiments by doing virtual experiments, they might be able to catch more before roll-out.

> I also think that for security purposes they should leave out extraneous detail

Disagree completely, it's the frank detail that makes me trust their story.

Maybe, but I think that their "Informed Speculation" section was probably unnecessary. They may or may not be correct, but give Flexential an opportunity to share what actually happened rather than openly guessing on what might have happened. Instead, state the facts you know and move onto your response and lessons learned.
Yeah, that part really rubbed me the wrong way. If this was a full postmortem published a couple of weeks after the fact and Flexential still wasn't providing details, I could maybe see including it, but this post is the wrong place and time.
I prefer to have their informed speculation here.

Has Flexential provided a similarly detailed, public root cause analysis? If so, maybe we can refer to it. If not, how do you expect us to read it?

It’s only been a couple of business days, and it’s likely that they themselves will need root cause from equipment vendors (and perhaps information from the utility) to fully explain what happened. Perhaps they won’t publish anything, but at least give them an opportunity before trying to do it for them.
I expect them to start reporting out what they know immediately, and update as they learn more. If they're not doing that, and indeed haven't reported anything in days, that is a huge failure.

Imagine if the literal power company failed, and took days to tell people what was going on. You can see why people are reading the postmortem that exists, rather than the one that doesn't.

Cloudflare vowed to be extremely transparent since the start of their existence. I'm very happy with the fact they have managed to keep this a core company value under extreme growth. I hope it continues after they reach a stable market cap. It isn't like Google that vowed not to be evil until they got big enough to be susceptible to antitrust regulation and negative incentives related to ad revenue.
What "security purposes"? Good security isn't based on ignorance of a system, it is on the system being good. We create a self fulfilling prophecy when we hide security practices because what happens is then very few will properly implement their security. Openness is necessary for learning.
> know they want to hold their suppliers accountable

They do both. They stated what their problem was and they stated their due diligence in picking a DC

> While the PDX-04’s design was certified Tier III before construction and is expected to provide high availability SLAs

They said the core issue: innovating fast, which led to not requiring in the high availability cluster.

Which is also a fix.

From cloudflare 's POV, part of what made it originally worse, is the lack of communication by the DC.

Which is an issue, if you want to inform clients.