| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by yowai 961 days ago

>1) Were you affected on the data plane? Which product?

No, but we needed to make urgent changes.

>2) Both examples were exactly from 2 November. Not 3 November.

Both messages contain no clear messages about remediation and co. They also didn't state clearly which products were failed over. I noticed that at this point I could at least login to the dashboard, but most stuff was still severely broken, and I had no idea whether changes with the few semi-functional components were actually applied or not.

Updates to single products with a more clear status were given only at the end of November 2nd (UTC).

(Also one of the message states data centres - not just data center. Not sure what happened there).

>3) What method of support did you try? I thought that their support was impacted ( email?).

Emergency line + contacting our CSM. The emergency line was shut down and replaced with voice mail (WTF?), and our CSM did not reply at all (or the message somehow made it to the wrong person, I'll find out next week, I guess).

So in our case, the communication was essentially non-existent, even though I raised a support case (or wanted to).

>4) I have never heard of Enterprise customers being contacted by a cloud company during an outage. Which company does that? Do you have an example?

I can remember of Datadog reaching out to us for their 2023-03-08 incident. Not sure if it was just our CSM being nice or someone did a support request on another communication channel, but looking back in history that came without asking + the post mortem. Same case when stuff happens such as vulnerabilities in one of their packages, they reach out to us proactively and notify us.

To be fair, this is a bit of a wishlist and definitely not necessary for a 30 minutes hickup, but for a 2 day outage... I don't know.

At the bare minimum, I'd expect at least their support team to be replying and not shutting down the communication channels.

>5) I would think it's absolutely a nogo to contact every preemptively Enterprise customer with: "hey, the product works, but if you change xyz, atm that doesn't.".

I don't know... At least at the time I raise an urgent support case about an issue, I expect to be kept up-to-date.

> Since most customers weren't affected and some others were minorly impacted.

What does it mean they were not affected? Yes, their core service was still functioning (thank god - after all they advertise a 100% (!) SLA on that), but you can see on same Discord channel you mentioned people failing to renew TLS certificates, people couldn't make Vercel deployments and more. So it did affect quite a bunch of downstream customers in their products, and they might also sell SLAs to their customers...

I cannot really comment on whether that just affected us, or if other customers had better support experiences here.

But I expect better in terms of communication here. Doesn't have to be as outreaching as I did in my last message, but stuff like shutting down the emergency line and not giving any comment is not really acceptable for an Enterprise contract.

1 comments

NicoJuicy 961 days ago

Just mentioning from "the other side".

We are a service provider in ( mostly) Europe.

Our policy ( playbook) in case of an issue is updating the status page as quick as possible and customers can subscribe on RSS.

There was one issue in the past where we wanted to inform the clients. But it's not easy, as only some were impacted and we decided against it.

5 minutes later ( it was out of our hands) it was solved...

Our playbook is too update the status page as soon as possible to inform the clients something is up and we are aware.

There shouldn't be too much info on it, since sometimes you just aren't 100% sure about what's exactly going on.

We also decided that we want provide durations on it, since you then create a commitment that's possibly dependent on external factors.

Tbh. I can completely understand the approach from Cloudflare here. With an issue, support is overwhelmed. That's why you use the status page ASAP.

Technical details happen in the post-mortem. When we can be sure if any data is lost ( normally, there is nothing lost though, but it's possible we need to requeue some actions)

=> this is when we can contact our clients and brought up to date.

Depending on the SLA it's included or eg. Is paid extra ( in a lot of times, an external provider fails and we can fix something from our end, eg. Resending some data)