Hacker News new | ask | show | jobs
by felixhuttmann 1666 days ago
If terraform crashes during apply, it leaves behind an inconsistent state by design: The lock is still set, and some resources which were created already are not yet in the statefile. Trying to re-run terraform after a crash during apply will generally lead to an error: Even if the lock is removed, resources may still conflict if they already exist.

In contrast, when a kubernetes controller or operator crashes, it can be expected to continue seamlessly where it left off.

It is easier to write kubernetes controllers that are able to continue seamlessly then to write terraform providers that do so, because of the granularity of the persistence of the state machine. Terraform locks the remote state, then applies all resources in the current root module, then unlocks the remote state again. In contrast, kubernetes operators can granularly update individual objects after each API call that is performed.

2 comments

That is an (incredibly) poor explanation of Terraform plugin separation from state management (or perhaps you’re using a weird locking backend that has more issues than correctly implemented ones). If created resources don’t end up in the state file, that is a provider bug (they failed to call `SetId` soon enough).

Source: was a core developer of Terraform and some of the largest providers for many years.

If the explanation is incredibly poor, why is this issue still open: https://github.com/hashicorp/terraform/issues/20718
I have never seen Terraform crash in the years I’ve been using it. Optimizing based on a lottery event seems counter-productive

EDIT: Alright guys, good points all around. Mainly what I meant is the ratio of useful of Terraform vs its shortcomings. Not hating on the Cluster API pattern, but imho better to stick to standardized approach.

That's a very shortsighted comment. Just because you've never seen it happen, it doesn't mean that it doesn't. While terraform itself may not crash (maybe not in the manner that I assume you're referring), you are at the mercy of the implementation of the specific providers for which I've (and many others as well, I'm sure) seen plenty of issues. Even in "mature" ones as aws/gcp.

Besides that, the control-loop pattern, which the parent comment describes is a very sound design, which is employed not only in kubernetes and has nothing to do with "Optimizing based on a lottery event".

> That's a very shortsighted comment.

Is it though? GP claims to have been using terraform for years.

I have been using terraform for some time too, and from what i've seen well written providers will error out in a safe and controlled manner instead of making the whole thing crash.

> Even in "mature" ones as aws/gcp.

Can't comment for the gcp provider, but I've worked at companies where the terraform provider is used quite extensively and really can't recall a real crash that created serious problems. And i've been also using the openstack provider, again, quite extensively, with not much troubles.

So in conclusion I think that it's your argument that's really short-sighted, because it's based on FUD and situations that are theoretically possible but sufficiently remote in the real world that they can be ignored.

Doing a very quick search [1] on the aws provider github repo, produces `92` open, and `306` closed issues with various bugs in the provider. I agree that most of them might not count as creating serious problems, but I as well have been using it quite extensively and can definitely remember cases where I have had to manually deal with corrupted state in various ways.

I'm not sure which part of my comment was "FUD" - I never said that you shouldn't use terraform, I was pointing out that issues exist, always. I think of it as the nature of the work we're doing (I say we, as I imagine you are also in the SWE field based on your comments). Show me any piece of software without bugs, especially as complex as a tf provider, and I'm buying you whatever you want.

[1] https://github.com/hashicorp/terraform-provider-aws/issues?q...

Anecdotally I've had to untangle lots of half applied state. Sometimes a provider has a bug, sometimes somebody on the other side of your company changed your terraform plan without migrating the running state, sometimes there are network issues that prevent your plan from applying correctly, sometimes the cloud provider said it deployed the resources but it actually didn't, etc, etc.

It's extremely painful and delicate to do and it may not happen frequently, but in my experience it happens frequently enough when I'm dealing with a lot of terraform.

I've had Terraform break several times when my AWS STS token expired half way through a deployment. When that happens I ended up having to manually delete all the created resources because the Terraform state was corrupted to the point I couldn't reapply to create the remains resources but also couldn't delete the existing resource with Terraform.

That said, I agree in general that this wouldn't be the main reason to use the Cluster API over Terraform.

"Optimizing based on a lottery event seems counter-productive"

I think it depends on how often you "win" the lottery. If a lottery event happens once or twice a year and it's fairly hard to resolve, we classify them as land mines (or sea mines for the more maritime oriented people among us). These issues get a lower priority, but we'll try to work through them to avoid emergencies or lower their impact. So, maybe not optimizing, but increase resilience against lottery events.