Hacker News new | ask | show | jobs
by klohto 1667 days ago
I have never seen Terraform crash in the years I’ve been using it. Optimizing based on a lottery event seems counter-productive

EDIT: Alright guys, good points all around. Mainly what I meant is the ratio of useful of Terraform vs its shortcomings. Not hating on the Cluster API pattern, but imho better to stick to standardized approach.

4 comments

That's a very shortsighted comment. Just because you've never seen it happen, it doesn't mean that it doesn't. While terraform itself may not crash (maybe not in the manner that I assume you're referring), you are at the mercy of the implementation of the specific providers for which I've (and many others as well, I'm sure) seen plenty of issues. Even in "mature" ones as aws/gcp.

Besides that, the control-loop pattern, which the parent comment describes is a very sound design, which is employed not only in kubernetes and has nothing to do with "Optimizing based on a lottery event".

> That's a very shortsighted comment.

Is it though? GP claims to have been using terraform for years.

I have been using terraform for some time too, and from what i've seen well written providers will error out in a safe and controlled manner instead of making the whole thing crash.

> Even in "mature" ones as aws/gcp.

Can't comment for the gcp provider, but I've worked at companies where the terraform provider is used quite extensively and really can't recall a real crash that created serious problems. And i've been also using the openstack provider, again, quite extensively, with not much troubles.

So in conclusion I think that it's your argument that's really short-sighted, because it's based on FUD and situations that are theoretically possible but sufficiently remote in the real world that they can be ignored.

Doing a very quick search [1] on the aws provider github repo, produces `92` open, and `306` closed issues with various bugs in the provider. I agree that most of them might not count as creating serious problems, but I as well have been using it quite extensively and can definitely remember cases where I have had to manually deal with corrupted state in various ways.

I'm not sure which part of my comment was "FUD" - I never said that you shouldn't use terraform, I was pointing out that issues exist, always. I think of it as the nature of the work we're doing (I say we, as I imagine you are also in the SWE field based on your comments). Show me any piece of software without bugs, especially as complex as a tf provider, and I'm buying you whatever you want.

[1] https://github.com/hashicorp/terraform-provider-aws/issues?q...

Anecdotally I've had to untangle lots of half applied state. Sometimes a provider has a bug, sometimes somebody on the other side of your company changed your terraform plan without migrating the running state, sometimes there are network issues that prevent your plan from applying correctly, sometimes the cloud provider said it deployed the resources but it actually didn't, etc, etc.

It's extremely painful and delicate to do and it may not happen frequently, but in my experience it happens frequently enough when I'm dealing with a lot of terraform.

I've had Terraform break several times when my AWS STS token expired half way through a deployment. When that happens I ended up having to manually delete all the created resources because the Terraform state was corrupted to the point I couldn't reapply to create the remains resources but also couldn't delete the existing resource with Terraform.

That said, I agree in general that this wouldn't be the main reason to use the Cluster API over Terraform.

"Optimizing based on a lottery event seems counter-productive"

I think it depends on how often you "win" the lottery. If a lottery event happens once or twice a year and it's fairly hard to resolve, we classify them as land mines (or sea mines for the more maritime oriented people among us). These issues get a lower priority, but we'll try to work through them to avoid emergencies or lower their impact. So, maybe not optimizing, but increase resilience against lottery events.