Hacker News new | ask | show | jobs
by zytek 4123 days ago
Been there, done that. AWS re:Boot in September 2014 showed us how good it was to invest in Ansible roles for all parts of our infrastructure. Still, a lot of hassle for Ops Team, especially that it was done during DevOps Days Warsaw ;-) AWS also said '10%' then, but for us it was 81 out of ~300 instances.

What is sad is that we learn about it from Hacker News and not from AWS, even when we have premium support and our own account manager. :/

Let's see how many of us did their homework after previous "xen update", and how much "10%" is now ;-)

2 comments

Similar experience here. I have a few particularly memorable experiences of dealing with the fallout from the September reboots. Although to be truthful this is partly due to the fact we were moving office and I had people jubilantly packing up around me as I worked to keep things afloat.

Not wanting a repeat of this we have migrated as many services we can into autoscaling groups, and automated all resource creation with CloudFormation.

This was inspired by this excellent Netflix blogpost: http://techblog.netflix.com/2014/10/a-state-of-xen-chaos-mon...

I have 19 instances (18 in US-West-2) and none of them are affected. I would guess that lots of people here run in us-east-1 since that's the longest running region and I would bet that a lot of that 10% exists there. So, it may be 10% total in all regions but higher percentage if you run in us-east-1. Just a guess though.
I guess all the Events weren't showing up yet because now I have 6 / 19 instances going down for a reboot.
44% of my instances in us-west-2 are affected, 55% of my instance in us-east-1, and 18% of my instances in eu-west-1 are affected. It seems to be tied pretty tightly to instance types.

Overall, I'm looking at a huge quantity of affected servers. That said, I don't blame AWS. I blame my incompetent architect for designing systems that are incredibly hard to upgrade, and that can't be rebooted safely. Definitely not bitter at that idiocy at all.

Yeah, I'd guess it depends mainly on what type of instances you run. Only one of the 28 instances I'm running right now (all in us-east-1) is going to be rebooted, and it is the only old generation instance I still run (it's a hi1.4xlarge). None of my M3s, C3s or R3s are affected, even though some are still on PV.