Hacker News new | ask | show | jobs
by ctvo 1438 days ago
> On July 13th, at around 18:45 UTC we started to receive reports of an outage from a small number of users. We investigated the status of our services, but were unable to confirm any of the reports. All of our status monitoring and tests reported that everything was operating normally.

> Over the course of the outage, we continued to monitor our service status, and worked with some of the affected users to narrow down the source of the problem.

> On July 14th, at 19:14 UTC we were able to identify that the problem was within our us-west3 region, which we then took offline, directing traffic to other nearby regions instead.

The time difference between when the first reports came in and when it was confirmed is a little concerning.

As an aside:

> ... approximately 18:00 UTC ...

> ... just over 24 hours ...

> ... For a period of around 24 hours, some users in the us-west3 region

> ... less than 30 minutes ...

> ... On July 13th, at around 18:45 UTC we started to receive reports of an outage from a small number of users. ...

"Approximately", "just", "around", "some", "small number of". It goes on and on. I disagree with the stylistic approach of being less specific in posts like these. A "small number of users" is relative. As readers, we have no idea what your typical load may be. Small may be a large number to us. "Just" over 24 hours is 26 hours? 24.5 hours? I implore you to be specific when you have the actual data.

These terms read as weasel words, and impact your effort at being fully transparent.

2 comments

This is a brand new platform and I assume this blog post is one of their first post-mortems. I’m inclined to give them a break, and respect them for establishing a process of post-mortems.

That is all to say, regardless of whether the post itself is “a little concerning,” it would be more concerning if the post didn’t even exist. And if you weren’t one of their affected customers, you likely wouldn’t even know this happened. So they did the right thing by publishing it and opening themselves to your criticism, which is a positive sign for the future of the platform IMO.

In this case it’s disappointing especially because we know they have exact numbers.
Arguably the exact numbers aren't interesting either. 456 982 requests from 5674 projects by 3091 user accounts were dropped during the 23:53:42.104 hours the outage lasted. I think it makes sense to make an interpretation and present that instead of raw numbers.
No one needs to know the raw numbers. That may be a little more helpful than no numbers but that could also be internal data they’re not comfortable sharing, and that’s fine. There are numerous other ways to write this that would have communicated the scale of the impact more rigorously, but not deal in absolute, raw numbers.

For example, by using percentages: This outage impacted 1.5% of customer requests vs. a “small” amount.

Good point. I disagree for my own preference but I don’t doubt that you’re right that many people would rather get the gist of the issue. I’m a numbers person but I know many aren’t (which is perfectly fine — neither way is inherently better).
I think there are some investors out there who might find these numbers more interesting than most numbers they will see today.
We actually don’t, hence the vague language. As mentioned at the bottom, this incident really revealed some wholes in the insights we have. I think it’s also important to remember that Deploy is still in public beta, specifically because we are aware that it’s not perfect yet.
It's not clear to the reader what metrics you have and what you do not. Reading the post top to bottom, when we get to this line at the end:

> This incident has made it clear that a few blindspots exist within our monitoring systems.

I assume it's related to metrics at the load balancer layer not tracking that it's failing to forward the traffic downstream to registered hosts.

> For a period of around 24 hours, some users in the us-west3 region were unable to access dash.deno.com, and Deno Deploy projects, including deno.com and deno.land.

This earlier sentence implies to me that you do have an idea of the scale of the outage. If you instead meant "some" to be "the users who reported they were impacted directly to Deno" and not a metric you have access to, you should just say it.

For example:

We are unable to determine the impact of the outage to customers in the us-west3 region outside of those who reported the issue to us directly.

There's less ambiguity there.

> I think it’s also important to remember that Deploy is still in public beta, specifically because we are aware that it’s not perfect yet.

No system is perfect. There will always be outages. Taking a more rigorous and transparent approach to postmortems isn't related to your system being perfect.

I'm circling back to this days old thread to give this feedback in the spirit of constructive criticism. It's not meant to embarrass or call anyone out specifically, and I hope you find it helpful.

Yeah, you make a particularly good point about acknowledging what data we don’t have more directly. Thanks for being constructive about it, I do really appreciate it.
Okay, thanks for the correction. I appreciate the transparency about there being holes, and I totally get that it’s a beta as well. I know what you’re doing isn’t easy either — there’s plenty of room for mistakes and learning.