| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by marcog1 3610 days ago
	This was the first time we had this class of outage. Many things were in a very bad state, and many of these symptoms were more familiar to us. So we spent time ruling them out before realising webserver CPU was closer to the root cause than the other symptoms. We roll back by reverting to a previous release on the load balancers, which is usually pretty instant. The previous releases were bad and themselves rolled back, which is a rare situation for us. So there was a bit of scrambling to look into the chat logs to determine a safe (non-rolled back) release we could roll back to. Then the high CPU caused our roll back to be really, really slow. Then we still had old processes running the bad release running, and killing them on webservers with high CPU took a while to actually work. Then it took a bit of time for load to come down on its own. All of this took place within the 8:08-8:29 window reported in the post. And I'm still simplifying a lot.

4 comments

tomjen3 3610 days ago

What I don't get is why you didn't see the relatively low cpu usage on the database server and the super high ones on the webserver immediately in a nagios (or similar) dashboard.

link

mkagenius 3610 days ago

They were distracted by the previous experience of having issues elsewhere.

link

lrascao 3610 days ago

And apparently there were no alarms in place for these kind of things

link

babo 3610 days ago

Apparently a lot of parts of the system were on alarm.

link

bdob4xcfH 3610 days ago

It's because they don't have a simple rollup dashboard that you can see that at a glance, like most places. Can you imagine if your car just showed you an event log for a door open, oil, turn singles on etc. that's what most monitoring systems are like these days.

link

jwatte 3610 days ago

Roll backs are in chat logs? I'd assume your scripts would record what they do when they do it, including roll backs.

Also, when only deploying two times a day, it's harder to tell which of the included changes have the problem. That's an argument for more frequent deploys!

link

abhishekash 3610 days ago

Seems like pretty ambitious logging that it tripped the servers !!! Will be careful with my logging next time :) .

link

ycombinatorMan 3610 days ago

Out of curiosity, why are you deploying to all your web servers simultaneously? Could you not do a partial roll-out to make sure something like this doesnt happen?

link

mkagenius 3610 days ago

I doubt partial roll out would have helped in this particular case since it only happens in high load and they roll out new code twice a day.

link

marcog1 3610 days ago

Correct. We don't roll out during peak load either.

link

tonfa 3610 days ago

Considered at least starting your release canary during peak load?

link

marcog1 3610 days ago

We have talked about it. It is unlikely to helped with an event like this, and I don't recall an event where it would have. It also has the downside of extending our deployment cycle by a lot. Notably, we do run a canary internally, and that had no issues, which actually through us off for a while because while the app was partially down for users it was working for us and that hasn't happened to us in a while.

link