Hacker News new | ask | show | jobs
by cookiecaper 3567 days ago
Reading through this, it sounds like some basic monitoring would've quickly allowed them to pinpoint the cause instead of wasting time with database servers. All it would take is pulling up the charts in Munin or Datadog or whatever and seeing "Oh, there's a big spike correlated with our deploy and the server is redlining now, better roll that back". A bug or issue in the recent deploy would logically be one of the first suspects in such a circumstance. Don't know why they wasted 30-60 minutes on a red herring. The correlation would be even more obvious if they took advantage of Datadog's event stream and marked each deployment.

Additionally, CPU alarms on the web servers should've informed them that the app was inaccessible because the web servers did not have sufficient resources to serve requests. This can be alleviated prior to pinpointing the cause by a) spinning up more web servers and adding them to the load balancer; or b) redirecting portions of the traffic to a static "try again later" page hosted on a CDN or static-only server. This can be done at the DNS level.

Let this be a lesson to all of us. Have basic dashboards and alarming.

1 comments

We have very comprehensive dashboards. Getting the perfect ones that help in all cases, while not being information overload (the problem here) and being discoverable is a hard, iterative process.
Yes, monitoring requires a lot of tuning until you find a sweet spot, but it doesn't sound like this is something that would've been buried deep in the annals of monitor. CPU/load data on your web servers should be pretty visible/accessible and one of the first graphs that get pulled up (and your alarms should've pointed out the issue anyway).

I'm not sure what you're using for dashboards but Datadog makes it pretty easy to find this stuff. I'm not a Datadog shill and I actually am not a huge fan of the product, but it's what we use and it's been a big help over our previous Munin installation.

Other process changes that could prevent this are good load testing in a stage environment and getting your company using the real prod code on the real prod infrastructure as its main/default install. A lot of the benefits of "dogfooding" are lost if you're using alpha code on dev-only boxes (as you state that you are in another comment).

As another commenter said, I'm not sure that postmortems like this are valuable unless the problem was particularly complex/interesting. I'm sure that a lot of people at Asana know how to fix this and that it's just a matter of getting management to allow them to do so. I'm sure you owe your customers an explanation of some sort, but I don't know if you need to get into details that say "Yeah, it was just a pretty typical organizational failure, we really should've known better". Everyone has those, but it's best not to publicize them too much.

I'm not going to hold it against Asana because I've worked at a lot of companies and I know how this goes, but when people come here and analyze the cause, as a postmortem invites the readers to do, you seem a little defensive. Perhaps it's best to keep the explanation more brief/vague when it's not a complex failure.