Hacker News new | ask | show | jobs
by dkoeji89oe 2825 days ago
For the same reason you write unit or integration tests.
1 comments

For software development those things make sense. Not so much for Operations(?)

I'm already declaring things with Ansible. Unless I got the syntax wrong ansible will finish a playbook and the server will be as I ordered ansible to make it.

So why add another check?

Belts and braces approach. I use Ansible but I also use testinfra (which is similar to serverspec and I suppose goss as well) to validate things that are not explicitly covered by Ansible, and even in some cases, some things that are.

The tests can be run independently at any later date, helping to ensure that admins haven't messed with important files, upgraded random RPM's or that the servers haven't suffered any other type of configuration drift.

But then I build/run emergency services infrastructure with 99.999% availability targets.

Once you are writing tests alongside your Ansible playbooks and committing them to source control, it doesn't really take a lot longer and eventually you have a test suite you can run across your entire environment at a moments notice.

"The tests can be run independently at any later date, helping to ensure that admins haven't messed with important files"

We all run puppet / ansible multiple times a day on our infra right? Checking for config drift.

I'm used to a banking sector where we have extremely stringent demands. And still I see no added value for something like infratest/goss.

Seriously: running ansible on your infra checking for config drift.. is that not exactly the same as running goss? Plus: Ansible returns changes to what they should be at the same time?

> We all run puppet / ansible multiple times a day on our infra right? Checking for config drift.

No, not necessarily. I think a lot depends on the rate of change in your environment and the size of the team managing it, and the number, complexity and types of devel/test/production environments being managed.

In an emergency services network like mine changes are rather the exception than the norm and running ansible/puppet in a loop every 30 mins is a waste of resources. Also, changes that break when pushed to production could result in fire engines or ambulances not being dispatched. Not good.

A devops team I worked for ran up a full vagrant VM on every commit to the puppet repository, then ran puppet inside the VM and ran the full gamut of testinfra tests as well. The whole process took 20-30 minutes at times and if the new code you just pushed broke the tests right at the end it would be 20-30 minutes before you found out. Of course you were supposed to mitigate this by running the tests against your new code in a VM before you pushed them. So that is the other extreme. Personally I found that to be overkill, although it didn't stop broken changes being implemented to production sometimes.

> Ansible returns changes to what they should be at the same time?

Well yes so does puppet, but that assumes that your puppet plays/ansible playbooks etc. are all written to be idempotent. The default modules generally work in that manner, but both allow you to write plays/playbooks that aren't idempotent and that can break things.

Infrastructure tests allow you to seperate out (or augment) the validation that running puppet/ansible gives you. Because the tests are generally of a read-only nature there is a less chance of accidentally changing the state of the servers when you run the tests.

You could run ansible or puppet in 'dry-run' mode I suppose, and examine the output for errors, but testinfra or serverspec give you a much nicer interface IMHO and are more lightweight and execute much faster than a full ansible/puppet run.

That sounds a lot like "if my code is bad, I'll find out when it breaks in production, so why test?". Tests allow you to make changes to your ansible configs, then test that the ansible config works correctly before you apply it in production.
"That sounds a lot like "if my code is bad, I'll find out when it breaks in production, so why test?""

Oh no defintely not, no I mean you test and build your ansible and cluster stuff in Development / Acceptance. We create our monitoring checks there as well. And when all that is done and "green": we go to production.

BTW. Testing your cluster is something you do in dev/acc before you go to prod. And Ansible / monitoring already shows you problems in those pre-production stages.

What added benefit is there for Goss in that scenario?

...you run goss during the process of developing your ansible roles.

Ansible roles are code. Goss allows you to test that code well before you're attempting to create real systems. Real systems include dev/acceptance envs where you're launching your application.

The value isn't in "Did this specific run of ansible work?" but earlier in the process --"After I made changes to my ansible role, does it still meet my requirements?"

aaah, ok now my veil before my eyes is clearing up.

So like in: I change a small thing in ansible, perhaps even in an unrelated role, and then check if all is still as I expect it?

exactly.

this is particularly important if you ever rely on 3rd party modules. a common pattern is to spin up VMs, apply config management, then validate config management as part of your infrastructure's ci process.