Hacker News new | ask | show | jobs
by iAMkenough 960 days ago
Even "we don't know why our data center is failing, but we're sending a team over to physically investigate now" would have been A+ communication in the moment.
1 comments

Everything was on the status page since the start?

DC related updates:

> Update - Power to Cloudflare’s core North America data center has been partially restored. Cloudflare has failed over some core services to a backup data center, which has partially remediated impact. Cloudflare is currently working to restore the remaining affected services and bring the core North America data center back online. Nov 02, 2023 - 17:08 UTC

> Identified - Cloudflare is assessing a loss of power impacting data centres while simultaneously failing over services.

We will keep providing regular updates until the issue is resolved, thank you for your patience as we work on mitigating the problem. Nov 02, 2023 - 13:40 UTC

As an enterprise customer, I would expect a CSM reaching out to us informing us about the impact, getting into more details about any restoration plans and potentially even ETAs or rough prioritization to resolution on them.

In reality, Cloudflare's support team was essentially completely unavailable on Nov 2, leaving only the status page. And for most of the day, the updates on the status page were very sparse except "we are working on it", and "We are still seeing gradual improvements and working to restore full functionality.".

Yet clearer status updates were only giving starting on Nov 3. However, I still don't think I heard anything from support or a CSM during that time.

?

1) Were you affected on the data plane? Which product?

As far as I can tell, while the outage was in the core dc's. The impact was minor.

2) Both examples were exactly from 2 November. Not 3 November.

3) What method of support did you try? I thought that their support was impacted ( email?).

The status page explicitly mentioned to get in contact with your account manager for some config changes on some products, if you wanted changes.

4) I have never heard of Enterprise customers being contacted by a cloud company during an outage.

Which company does that? Do you have an example?

5) I would think it's absolutely a nogo to contact every preemptively Enterprise customer with: "hey, the product works, but if you change xyz, atm that doesn't.".

Since most customers weren't affected and some others were minorly impacted.

There is not a single cloud company that does that.

Feel free to correct me if I'm wrong...

for us as an enterprise customer for many years:

ssl for saas -> custom hostnames are not working for new domains or changes to current ones. also page rules -> redirects are not working for new rules or changes to current rules. which are game-stoppers for our business.

we contacted via enterprise email support + ccing our managers and assigned engineers.

first they try to tell us product is working and sending us some details how to do that,this etc, after a couple of hours later they understand the issue is bigger than they thought and they said "the product is affected by api outage".

then in another email we asked them when this can be solved but only answer we got is "please follow status page for the updates".

and after a day, ssl for saas & ssl services took their places on status page. for a day nobody notices if it's working or not except customers.

so as we understand these emails even the team internally haven't got any idea what is working and what is not!

>1) Were you affected on the data plane? Which product?

No, but we needed to make urgent changes.

>2) Both examples were exactly from 2 November. Not 3 November.

Both messages contain no clear messages about remediation and co. They also didn't state clearly which products were failed over. I noticed that at this point I could at least login to the dashboard, but most stuff was still severely broken, and I had no idea whether changes with the few semi-functional components were actually applied or not.

Updates to single products with a more clear status were given only at the end of November 2nd (UTC).

(Also one of the message states data centres - not just data center. Not sure what happened there).

>3) What method of support did you try? I thought that their support was impacted ( email?).

Emergency line + contacting our CSM. The emergency line was shut down and replaced with voice mail (WTF?), and our CSM did not reply at all (or the message somehow made it to the wrong person, I'll find out next week, I guess).

So in our case, the communication was essentially non-existent, even though I raised a support case (or wanted to).

>4) I have never heard of Enterprise customers being contacted by a cloud company during an outage. Which company does that? Do you have an example?

I can remember of Datadog reaching out to us for their 2023-03-08 incident. Not sure if it was just our CSM being nice or someone did a support request on another communication channel, but looking back in history that came without asking + the post mortem. Same case when stuff happens such as vulnerabilities in one of their packages, they reach out to us proactively and notify us.

To be fair, this is a bit of a wishlist and definitely not necessary for a 30 minutes hickup, but for a 2 day outage... I don't know.

At the bare minimum, I'd expect at least their support team to be replying and not shutting down the communication channels.

>5) I would think it's absolutely a nogo to contact every preemptively Enterprise customer with: "hey, the product works, but if you change xyz, atm that doesn't.".

I don't know... At least at the time I raise an urgent support case about an issue, I expect to be kept up-to-date.

> Since most customers weren't affected and some others were minorly impacted.

What does it mean they were not affected? Yes, their core service was still functioning (thank god - after all they advertise a 100% (!) SLA on that), but you can see on same Discord channel you mentioned people failing to renew TLS certificates, people couldn't make Vercel deployments and more. So it did affect quite a bunch of downstream customers in their products, and they might also sell SLAs to their customers...

I cannot really comment on whether that just affected us, or if other customers had better support experiences here.

But I expect better in terms of communication here. Doesn't have to be as outreaching as I did in my last message, but stuff like shutting down the emergency line and not giving any comment is not really acceptable for an Enterprise contract.

Just mentioning from "the other side".

We are a service provider in ( mostly) Europe.

Our policy ( playbook) in case of an issue is updating the status page as quick as possible and customers can subscribe on RSS.

There was one issue in the past where we wanted to inform the clients. But it's not easy, as only some were impacted and we decided against it.

5 minutes later ( it was out of our hands) it was solved...

Our playbook is too update the status page as soon as possible to inform the clients something is up and we are aware.

There shouldn't be too much info on it, since sometimes you just aren't 100% sure about what's exactly going on.

We also decided that we want provide durations on it, since you then create a commitment that's possibly dependent on external factors.

Tbh. I can completely understand the approach from Cloudflare here. With an issue, support is overwhelmed. That's why you use the status page ASAP.

Technical details happen in the post-mortem. When we can be sure if any data is lost ( normally, there is nothing lost though, but it's possible we need to requeue some actions)

=> this is when we can contact our clients and brought up to date.

Depending on the SLA it's included or eg. Is paid extra ( in a lot of times, an external provider fails and we can fix something from our end, eg. Resending some data)

I've got no knock on the status page. Cloudflare is disappointed in the lack of notification from their data center provider, and Cloudflare customers are disappointed in the lack of notification from their service provider.

Instead of defending what was done and calling that good enough, Cloudflare should use this as an opportunity to commit to reevaluating the strategy for customer outreach during major service failures. If that's what Cloudflare expects from its service providers, that's what Cloudflare should provide to its customers.

?

You want Cloudflare to update every customer for an issue that they probably aren't affected with ( except when changing things) ?

Who even does that when you've got so many customers?

That's exactly what why the status page is there:

https://www.cloudflarestatus.com/

The DC obviously didn't have any means to update their customers.

I don't want that. Cloudflare's customers want that. Cloudflare was embarrassed and needs to listen to the feedback they're receiving.
There is literally not a single cloud company doing that.

Even those that had complete outages.