|
|
|
|
|
by kctess5
3567 days ago
|
|
I find it interesting that they didn't notice the overloading for so long. Also that it took so long to roll back. Given that they reportedly roll out twice a day, it seems like identifying a rollback target would be fairly quick. |
|
We roll back by reverting to a previous release on the load balancers, which is usually pretty instant. The previous releases were bad and themselves rolled back, which is a rare situation for us. So there was a bit of scrambling to look into the chat logs to determine a safe (non-rolled back) release we could roll back to. Then the high CPU caused our roll back to be really, really slow. Then we still had old processes running the bad release running, and killing them on webservers with high CPU took a while to actually work. Then it took a bit of time for load to come down on its own. All of this took place within the 8:08-8:29 window reported in the post. And I'm still simplifying a lot.