Arguably the exact numbers aren't interesting either. 456 982 requests from 5674 projects by 3091 user accounts were dropped during the 23:53:42.104 hours the outage lasted. I think it makes sense to make an interpretation and present that instead of raw numbers.
No one needs to know the raw numbers. That may be a little more helpful than no numbers but that could also be internal data they’re not comfortable sharing, and that’s fine. There are numerous other ways to write this that would have communicated the scale of the impact more rigorously, but not deal in absolute, raw numbers.
For example, by using percentages: This outage impacted 1.5% of customer requests vs. a “small” amount.
Good point. I disagree for my own preference but I don’t doubt that you’re right that many people would rather get the gist of the issue. I’m a numbers person but I know many aren’t (which is perfectly fine — neither way is inherently better).
We actually don’t, hence the vague language. As mentioned at the bottom, this incident really revealed some wholes in the insights we have. I think it’s also important to remember that Deploy is still in public beta, specifically because we are aware that it’s not perfect yet.
It's not clear to the reader what metrics you have and what you do not. Reading the post top to bottom, when we get to this line at the end:
> This incident has made it clear that a few blindspots exist within our monitoring systems.
I assume it's related to metrics at the load balancer layer not tracking that it's failing to forward the traffic downstream to registered hosts.
> For a period of around 24 hours, some users in the us-west3 region were unable to access dash.deno.com, and Deno Deploy projects, including deno.com and deno.land.
This earlier sentence implies to me that you do have an idea of the scale of the outage. If you instead meant "some" to be "the users who reported they were impacted directly to Deno" and not a metric you have access to, you should just say it.
For example:
We are unable to determine the impact of the outage to customers in the us-west3 region outside of those who reported the issue to us directly.
There's less ambiguity there.
> I think it’s also important to remember that Deploy is still in public beta, specifically because we are aware that it’s not perfect yet.
No system is perfect. There will always be outages. Taking a more rigorous and transparent approach to postmortems isn't related to your system being perfect.
I'm circling back to this days old thread to give this feedback in the spirit of constructive criticism. It's not meant to embarrass or call anyone out specifically, and I hope you find it helpful.
Yeah, you make a particularly good point about acknowledging what data we don’t have more directly. Thanks for being constructive about it, I do really appreciate it.
Okay, thanks for the correction. I appreciate the transparency about there being holes, and I totally get that it’s a beta as well. I know what you’re doing isn’t easy either — there’s plenty of room for mistakes and learning.