| So we had this cause a spectacular outage a few years ago. We were doing exactly this - but we had a flaw: we didnt handle the case when the AWS API was actually down. So we were constantly monitoring for how many running instances we had - but when the API went down, just as we were ramping up for our peak traffic - the system thought that none were running because the API was down - so it just kept continually launching instances. The increased scale of instances pummeled the control plane
with thousands of instances all trying to come online and pull down their needed data to get operational -- which them killed our DBs, pipeline etc... We had to reboot our entire production environment at peak service time... |
[1] https://docs.aws.amazon.com/autoscaling/ec2/APIReference/API...