Hacker News new | ask | show | jobs
by psanford 2014 days ago
If I were in AWS/GCP's position, I would prefer to send alerts rather than turn off services.

Shutting off services can mean destroying the customer's data with no way for them to recover it. That could be from terminated ephemeral disks, or a terminated database, or cutting off a critical upload stream into their instances.

Its a lot easier to reduce/forgive a bill when a customer makes a mistake than to recover their lost data (or loss to their business).

2 comments

Can’t you just stop spinning up new services and suspend running ones including connections and db accesses? Start with bandwidth?
Stopping DB accesses is a great way to mess up a lot of stuff. Even just stopping connections to the live website would be full of potential issues. Should it stop all access or still allow access to the administrative interface? What is the root cause of the billing overage is a report by finance or someone running a large job on EMR, should that pull the plug on the website?

What if the new services that are being suspended are writes to queuing systems that are used for order fulfillment or other business processes, should we drop these orders on the floor?

It's much easier to handle it post facto, and write off the expense on the cloud provider side, which doesn't cost them that much anyway. There are some guard rails that prevent people from doing catastrophic things that they can't write off (eg. taking all of the compute capacity of a region for hours on end, preventing other customers from actually using it) using limits that require manual intervention to be raised.

you can get billing alerts today. it’s not the lack of control mechanisms that prevented this failure. it was the noobism of the user
The over spend was caused by an application bug, not because the user was a "noob".