Hacker News new | ask | show | jobs
by jmbwell 961 days ago
If Flexential and PGE aren't sharing information or otherwise cooperating as much as Cloudflare might like, then going public with some speculation might be an attempt at applying some pressure to get to the bottom of what happened.

It might also be an effort to get out in front of the story before someone else does the speculating.

In any case, with at least three parties involved, with multiple interconnected systems… if Cloudflare is going to effectively anticipate this cluster of failure modes in future design decisions, it's reasonable for them to want to know what happened all the way down.

Edit to add: I for one am grateful for the information Cloudflare is sharing.

1 comments

>If Flexential and PGE aren't sharing information or otherwise cooperating as much as Cloudflare might like, then going public with some speculation might be an attempt at applying some pressure to get to the bottom of what happened.

It's been 2 days. I doubt PGE or Flexential even have root caused it yet, and even if they have, good communication takes time.

You don't throw someone under the bus and smear their name publicly just because they haven't replied for two days, and you certainly don't start speculating on their behalf. That's bad partnership.

You also don't publicly share what "Flexential employees shared with us unofficially" (quote from the article) - what a great way to burn trust with people who probably told you stuff in confidence.

>if Cloudflare is going to effectively anticipate this cluster of failure modes in future design decisions, it's reasonable for them to want to know what happened all the way down.

They can do all of that without smearing people on their company blog. In fact, they can do all of that without even knowing what happened to PGE/Flexential, because per their own admission they were already supposed to be anticipating this, but failed at it. Power outages and data center issues are a known thing, and is exactly why HA exists. HA which Cloudflare failed at. This post-mortem should be almost entirely about that failure rather than speculation about a power outage.

> You don't throw someone under the bus and smear their name publicly just because they haven't replied for two days, and you certainly don't start speculating on their behalf. That's bad partnership.

1. When you’re paying them the kind of money I imagine they’re paying and they don’t reply for 2 days, yea that’s crazy if true. I’d expect a client of this size could take to an executive on their personal number.

2. Telling the facts as you know them to be especially regarding very poor communication isn’t a smear.

They aren't telling the facts as they know them. Cloudflare themselves say that the information in the article is "speculation" (the article literally uses that term).

Publicly casting blame based on speculation isn't something you do to someone that you want to have a good working relationship with, no matter how much money you pay them.

That's not true. This is behaviour that would be enough for me to pull the plug working with this DC as this is more than unacceptable.
> if you want to have a good working relationship with

What are you disagreeing with OP ?

He is talking about how to behave if you continue the relationship not whether to continue it .

The post you're replying to is pointing out that multiple days without reporting out a preliminary root cause analysis is so absurdly below the expected level of service here that it would prompt them to reconsider using the service at all.

2 days is outrageous here, I have to imagine whoever thinks that is acceptable is approaching this from the perspective of a company whose downtime doesn't affect profits.

If you actually worked with datacenters you'd understand that what PGE and Flexential is unacceptable as well
Agreed. DC sends us notifications any time power status changes. We had a dark building event once, due actually to some similar sounding thing: power fail over caused some arc fault in HV that took out the fail over switchgear. We received updates frequently.

UPS failing early sounds like it may be a battery maintenance issue.

We have no idea what their contract is. But two business days without a reply isn’t exactly a long time. Especially if they are conducting their own investigation and reproduction steps.
> But two business days without a reply isn’t exactly a long time

What???? We have 4 hour boots on the ground support with Supermicro and that's a few thousand dollars a year lol.

That doesn't make any sense for a customer as big as CF.

My impression from reading the writeup is that CF did receive support and communication from Flexential during the event (although not as much communication as they would have liked), but hasn't received confirmation from Flexential about certain root cause analysis things that would be included in a post-mortem.

Two days without support communications would be a long time, but my original comment about the two day period is about the post-mortem. It's totally reasonable IMO for a company to take longer than two days to gather enough information to correctly communicate a post-mortem for an issue like this, and IMO its unreasonable for CF to try to shame Flexential for that.