|
|
|
|
|
by vitalus
2463 days ago
|
|
Definitely can sympathize with you on this, having spent plenty of time myself fighting some clusters that ended up in a broken state, and trying to get them going again. I think that this pain is sometimes more severe in the context of automated provisioning tools out there and the trend towards immutable infrastructure - folks tend to not have the know-how to dig in and mutate that state if need be. It's really important to have a story within teams, though, about either investing in the knowledge needed to make these fixes, or to have the tooling in place to quickly rebuild everything from scratch and cutover to a new, working production cluster in a minimal amount of time. |
|
As I build my knowledge I am also building Ansible playbooks and task files. After each iteration I shutdown my cluster. Do an automated rebuild and test. Delete the original cluster and start my next iteration.
I have an admin box with everything I need to persist between builds (Ansible, keys, configuration files, etc) and can deploy whatever size and quantity of workers (VM) needed.
It has been a good process so far. I haven't yet put things in an unrecoverable state, but if that happens I can rebuild the cluster to my most recent save and try again.
I don't see it taking a lot of resources to have a proving ground. I would definitely not feel comfortable going to production without the ability to reproduce the production clusters' exact state.
I anticipate exactly what you describe as a roll back mechanism. At all times I want to be able to automate the deployment of clusters to an exact known state.
I think building a cluster, walking away from it for a year, and then coming back to it for a break fix/update/new deployment is a huge gamble.