|
|
|
|
|
by marcog1
3563 days ago
|
|
This was the first time we had this class of outage. Many things were in a very bad state, and many of these symptoms were more familiar to us. So we spent time ruling them out before realising webserver CPU was closer to the root cause than the other symptoms. We roll back by reverting to a previous release on the load balancers, which is usually pretty instant. The previous releases were bad and themselves rolled back, which is a rare situation for us. So there was a bit of scrambling to look into the chat logs to determine a safe (non-rolled back) release we could roll back to. Then the high CPU caused our roll back to be really, really slow. Then we still had old processes running the bad release running, and killing them on webservers with high CPU took a while to actually work. Then it took a bit of time for load to come down on its own. All of this took place within the 8:08-8:29 window reported in the post. And I'm still simplifying a lot. |
|