Hacker News new | ask | show | jobs
by x86_64Ubuntu 2274 days ago
What's the technical process to ensure that this never happens? Nowadays, having to have someone "watch" the test and then kill the instances is manual labor which is a no-no. So how do you make it so that your test fires up the instances, and then kills them when the test is done.
5 comments

I think you have to have an upper bound set with AWS that kills stuff when you have reached the amount of money you want to spend. But of course, people would whine about that. "How AWS killed my business on the busiest day of the year," would probably be the article title.
But I hate far more sympathy for "I made an AWS mistake and got hit with an 100k bill" than "I told AWS to turn off my ec2 instances at 10k, and then at 10k it turned off my ec2 instances"
There are many ways to solve this problem. One way to do this is to model your test infrastructure in CloudFormation. You can then use an SSM Automation Document to manage the lifecycle of your test. Putting all your infrastructure in CloudFormation allows you to cleanup all of the test resources in single DeleteStack API call, and the SSM Document provides: (1) configurable timeout and cleanup action when the test is done, (2) auditing of actions taken, and (3) repeatability of testing.
Not sure if this would help in this particular scenario, but unit and integration testing of operations scripts can save a lot of pain, anguish and $$s too.

It's horrifying how many places treat writing tests for services as critical, but then completely fail to write tests for their operational tooling. Including tools responsible for scaling up and down infrastructure, deleting objects etc.

But if a test fails does it now mean you're bankrupt?
Could do? Not sure what your point is here.
You can do timed instances, and/or make the instances have timed job to shutdown after a fixed time (which is what I use to shut down an instance which only gets spooled up for occasional CI jobs after an hour).
+1. When I had to use AWS for batch workloads, which at the time at least didn't have a TTL attribute on VMs, I made sure that the VM first scheduled a shutdown in like 30 min if the test was supposed to only run in 10 min.
You can use auto scaling groups with a load balancer to terminate instances when not in use and spin them up as required.