Hacker News new | ask | show | jobs
by rjbwork 1102 days ago
I'm pretty sure they do, which is a feature, IMO.

We're coming up on 10000 resources in our main Terraform repository and while there is definitely some friction, it's overall much better than having to hit the cloud API's to gather each of those states which would probably take at least an order of magnitude longer.

We also just recently started setting up a periodic drift detection build to help identify and address drift.

Could you elaborate on the circumstances that have caused your state sync issues?

2 comments

> We're coming up on 10000 resources in our main Terraform repository and while there is definitely some friction, it's overall much better than having to hit the cloud API's to gather each of those states which would probably take at least an order of magnitude longer.

I don't think that's necessary true. Most cloud API's actually can return hundreds of records with 1 API calls, e.g. https://docs.aws.amazon.com/elasticloadbalancing/latest/APIR... has a maximum page size of 400.

If I manage the cloud resources via some custom tools and/or with some ansible-fu, I can decide to batch the API calls when it makes sense.

With terraform, it is not possible to do so (https://github.com/hashicorp/terraform-plugin-sdk/issues/66, https://github.com/hashicorp/terraform-provider-aws/issues/2...).

Yeah we have a 10Mb state file and it already takes 30-45 min to deploy. The main way it happens is if someone cancels a CI/CD job because they're realised something is wrong and don't want to wait that long to try again. That's not an issue if there isn't a state file, but yeah I can see how there would be a tradeoff possibly associated.
Pulumi is pretty good about keeping track of in-progress state. If you cancel and restart a new job, it’ll first ask you what to do with the in-progress resources.
IME the plan times explode because the refreshes hit the cloud provider API rate limits.
That does seem like quite a large state file. Would you be able to shed some light on the sorts of of resources that you're provisioning? I've found it to be useful to split dependencies between cloud resources, and to add links via data lookups. At any reasonable scale, I've found splitting Terraform state to be crucial.

For example: network resources and Kubernetes clusters are created separately, and networking is a required dependency for Kubernetes clusters. Resources are associated with environments and regions, and modules take in environments and regions as parameters.

In my setup, we have 200+ state files, and the largest is around 80KB.

How do you manage the meta DAG between all of that stuff in the event of, e.g. complete disaster recovery?