| HN Mirror

Yes, monitoring requires a lot of tuning until you find a sweet spot, but it doesn't sound like this is something that would've been buried deep in the annals of monitor. CPU/load data on your web servers should be pretty visible/accessible and one of the first graphs that get pulled up (and your alarms should've pointed out the issue anyway).

I'm not sure what you're using for dashboards but Datadog makes it pretty easy to find this stuff. I'm not a Datadog shill and I actually am not a huge fan of the product, but it's what we use and it's been a big help over our previous Munin installation.

Other process changes that could prevent this are good load testing in a stage environment and getting your company using the real prod code on the real prod infrastructure as its main/default install. A lot of the benefits of "dogfooding" are lost if you're using alpha code on dev-only boxes (as you state that you are in another comment).

As another commenter said, I'm not sure that postmortems like this are valuable unless the problem was particularly complex/interesting. I'm sure that a lot of people at Asana know how to fix this and that it's just a matter of getting management to allow them to do so. I'm sure you owe your customers an explanation of some sort, but I don't know if you need to get into details that say "Yeah, it was just a pretty typical organizational failure, we really should've known better". Everyone has those, but it's best not to publicize them too much.

I'm not going to hold it against Asana because I've worked at a lot of companies and I know how this goes, but when people come here and analyze the cause, as a postmortem invites the readers to do, you seem a little defensive. Perhaps it's best to keep the explanation more brief/vague when it's not a complex failure.