|
There was a good talk at Hashiconf a few years back, can't find a link now. I can't say that this is the "right" way to do it, but it scales OK for our org (100+ engineers, 200+ services per environment, many third party and in-house providers). A single team owns the codebase, though all backend engineers expected to write and maintain their own infra code. Some highlights: * Monorepo with ~ 700 terraform modules, > 1000 terraform workspaces
* CI/CD tooling to work out the graph of workspaces that need to be executed for a given change using merkle tree/fingerprinting, and build dynamic plan/apply pipelines for a given PR
* Strict requirements about master being up to date, and serialised merges managed by a bot, with continuous deployment (i.e apply) on merge
* Templated code/PR generation for common tasks
* Tooling for state moves, lock management etc
* Daily full application of all workspaces with alerting on state drift (e.g externally updated resources)
As an org, we average about 20 infrastructure changes per day through this system.A few tips: * Find the right level of abstraction for breaking larger workspaces down into smaller ones. This should be determined by things like rate of change, security requirements, team ownership, and in some cases whether you have a flaky provider that you want to isolate. Size of state should also be a consideration - if a workspace takes 2 minutes to plan, it's too big
* If you start to use lots of remote states, wrap all remote states into a module with a sane interface to make consumption easier. You can also embed rules in this like "workspace X cannot consume from state Y" (e.g because of circular dependencies or security considerations)
* Never embed a provider within a module (I think this is enforced in newer TF versions)
* Terraform is a hammer that will can tackle most nails, but for several problems there are more appropriate tools
|
In that case, the best choice is lots of remote states / data sources, independent modules in independent repos that reuse other modules, and strict adherence to internal conventions, including branching/naming/versioning standards running the gamut from your VCS, to the module, to the code, to the data structures, to the "terraformcontrol" repos, etc. Basically, standardize every single possible thing. If anyone ever needs something to work differently, update the standard before the individual module.
How and when to separate remote states is still a bit of black magic. In general, you can make a new state for each complete unit of deployment. Assuming a deployment has stages, you can separate terraform state into those different stages, so that you can step through them applying as you go and stopping if you detect a problem. The biggest mishap is when you're trying to apply 100 changes and your apply fails half way, and you have to stop the world to manually fix it, or revert, which may not even work. It's much easier to manage a change that affects a few resources than lots of them.