|
|
|
|
|
by cookiecaper
3567 days ago
|
|
Reading through this, it sounds like some basic monitoring would've quickly allowed them to pinpoint the cause instead of wasting time with database servers. All it would take is pulling up the charts in Munin or Datadog or whatever and seeing "Oh, there's a big spike correlated with our deploy and the server is redlining now, better roll that back". A bug or issue in the recent deploy would logically be one of the first suspects in such a circumstance. Don't know why they wasted 30-60 minutes on a red herring. The correlation would be even more obvious if they took advantage of Datadog's event stream and marked each deployment. Additionally, CPU alarms on the web servers should've informed them that the app was inaccessible because the web servers did not have sufficient resources to serve requests. This can be alleviated prior to pinpointing the cause by a) spinning up more web servers and adding them to the load balancer; or b) redirecting portions of the traffic to a static "try again later" page hosted on a CDN or static-only server. This can be done at the DNS level. Let this be a lesson to all of us. Have basic dashboards and alarming. |
|