Hacker News new | ask | show | jobs
by fizlebit 760 days ago
Even if it was operator error some sort of public COE would help others avoid the pitfall by design, e.g. restricting the permissions of terraform so that it can only affect resources for the system and availability zone (or better still cell) under deployment, e.g. you're running a deployment to system X, you shouldn't be able to destroy your backup buckets. Essentially minimizing the blast radius of configuration operation. I guess you'd also want to one-box the terraform change after testing it in preprod ideally though a pipeline with monitoring. "The power to modify is the power to destroy." Finally I wonder if there is a some way say to terraform, don't delete more than x resources and start very slowly, and only delete leaf resources, not the top level resource.

At the end of the day terraform can have a bug. You really want to control blast radius with permissions. Makes me wonder if the GCP VMWare integration is a boundary that doesn't expose granular permissions.

If it was operator error with terraform that should set off alarm bells through the industry. Who else is one fat finger away from total annihilation.

2 comments

Somewhere in Australia (with Australian accents):

"Hey terraform just output a wall of text, it wants to know whether or not to proceed."

"That's what it does mate. Let it do its thing, she'll be right."

OK do people not inspect the output? I do every time just in case terraform goes off the deep end and decides on violence
I used to ~own all internal terraform usage at a very large software company. I could talk for hours about all the ways in which the "wall of text" is ineffective at scale. A surprisingly large amount of our technical investment in TF was around improving plan review.
I usually only look deeper if the summary (X created, Y updated, Z deleted) shows some unexpected numbers, especially if the number deleted is not 0. However, if nothing's being deleted, I usually assume it's safe enough.
Is the VMWare integration tied to the GCP marketplace? Maybe Terraform modifying a fundamental VMWare resource (like renaming a cluster, etc) where the entire account is tied to a reseller or integration account could cause it to think it needs to delete and re-create.