Hacker News new | ask | show | jobs
by w10-1 962 days ago
I agree, but I also think that for security purposes they should leave out extraneous detail. Also, I know they want to hold their suppliers accountable, but I would hold off pointing fingers. It doesn't really improve behavior, and it makes incentives worse.

I really appreciate that they're going to fix the process errors here. But as they suggested, there's a tension between moving fast and being sure. This is typically managed like the weather, buying rain jackets afterwards (not optimal). I'd be curious to see how they can make reliability part of the culture without tying development up in process.

Perhaps they can model the system in software, then use traffic analytics to validate their models. If they can lower the cost of reliability experiments by doing virtual experiments, they might be able to catch more before roll-out.

3 comments

> I also think that for security purposes they should leave out extraneous detail

Disagree completely, it's the frank detail that makes me trust their story.

Maybe, but I think that their "Informed Speculation" section was probably unnecessary. They may or may not be correct, but give Flexential an opportunity to share what actually happened rather than openly guessing on what might have happened. Instead, state the facts you know and move onto your response and lessons learned.
Yeah, that part really rubbed me the wrong way. If this was a full postmortem published a couple of weeks after the fact and Flexential still wasn't providing details, I could maybe see including it, but this post is the wrong place and time.
I prefer to have their informed speculation here.

Has Flexential provided a similarly detailed, public root cause analysis? If so, maybe we can refer to it. If not, how do you expect us to read it?

It’s only been a couple of business days, and it’s likely that they themselves will need root cause from equipment vendors (and perhaps information from the utility) to fully explain what happened. Perhaps they won’t publish anything, but at least give them an opportunity before trying to do it for them.
I expect them to start reporting out what they know immediately, and update as they learn more. If they're not doing that, and indeed haven't reported anything in days, that is a huge failure.

Imagine if the literal power company failed, and took days to tell people what was going on. You can see why people are reading the postmortem that exists, rather than the one that doesn't.

Cloudflare vowed to be extremely transparent since the start of their existence. I'm very happy with the fact they have managed to keep this a core company value under extreme growth. I hope it continues after they reach a stable market cap. It isn't like Google that vowed not to be evil until they got big enough to be susceptible to antitrust regulation and negative incentives related to ad revenue.
What "security purposes"? Good security isn't based on ignorance of a system, it is on the system being good. We create a self fulfilling prophecy when we hide security practices because what happens is then very few will properly implement their security. Openness is necessary for learning.
> know they want to hold their suppliers accountable

They do both. They stated what their problem was and they stated their due diligence in picking a DC

> While the PDX-04’s design was certified Tier III before construction and is expected to provide high availability SLAs

They said the core issue: innovating fast, which led to not requiring in the high availability cluster.

Which is also a fix.

From cloudflare 's POV, part of what made it originally worse, is the lack of communication by the DC.

Which is an issue, if you want to inform clients.