Hacker News new | ask | show | jobs
by mpdehaan2 4729 days ago
We've got quite a few users managing hundreds and thousands of hosts, so I'm not seeing these kinds of compliants. If I would, I'd feel it, but we don't :)

One of the things many people want to do is rolling updates too, and Ansible is remarkably good at them, having a language that is really good for talking to load balancers and monitoring and saying, "of this 500 servers, update 50 at a time, and keep my uptime". Folks like AppDynamics are using this to update all of their infrastructure every 15 minutes, continuously, and it's pretty cool stuff.

For those folks that do want to do the 'facebook scale' stuff, ansible-pull is a really good option. One of the features in our upcoming product is a nice callback that enables this while still preserving centralized reporting.

Happy to have the conversation, but definitely I've never heard the CPU time compliant. I think the one thing we see is a lot of users are happy that Ansible is not running when it is not managing anything, rather than having daemons sucking CPU/RAM/etc, and folks are actually getting a little better performance from avoiding the thundering herd agent problems.

3 comments

I just did some consulting on helping another team improve their hadoop cluster performance and the first thing I noticed is that all 40 boxes in the cluster were burning a CPU core with a puppet agent process that was running at 100% CPU for months.
That's one of the nicer things about the no agent setup, when Ansible is not managing something, there is nothing eating CPU or RAM, and you don't have a problem with your agents falling over (SSHd is rock solid), so you get out of the 'managing the management' problem as well as the 'management is effecting my workload performance' problem.

In particular with virtualization, running a heavy agent on every instance can add up. (reports of the ruby virtual machine eating 400MB occasionally occur).

How does Ansible effectively scale to thousands of hosts using ssh? My experience is that you can only run a few hundred ssh sessions at a time with reasonable performance, and that's on beefy hardware to begin with.
Several different options.

Many folks are actually not doing repeated config management every 30 minutes, though I realize that may be heresy to some Chef/Puppet shops, there's also a school of thought that changes should be intentional. So there is often a difference in workflow.

LOTS of folks are doing rolling updates, because rolling updates are useful.

Many folks are also using ansible in pull mode.

You could also set up multiple 'workers' to push out change, using something like "--limit" to target explicit groups from different machines.

What happens if you feed Ansible --forks 50 it's going to talk to 50 at a time and then talk to the next (it uses multiprocessing.py). If you also set "serial: 50" that's a rolling update of 50 nodes at a time, to ensure uptime on your cluster as you don't take the 1000 nodes down at once.

This is really more of a push-versus-pull architecture thing, while it presents some tradeoffs, it's also the exact mechanism that allows the rolling update support and ability to base the result of one task on another to work so well.

Ansible also has a 'fireball' mode which uses SSH for the initial connection for key exchange and then encrypts the rest of the traffic. It's a short-lived daemon that doesn't stay running, so when it is done, it just expires.

> Many folks are actually not doing repeated config management every 30 minutes, though I realize that may be heresy to some Chef/Puppet shops, there's also a school of thought that changes should be intentional. So there is often a difference in workflow.

I think this is a false dichotomy. Those who believe runs should be performed frequently often implement this to revert manual changes performed by people operating contrary to site policy.

Not so sure, I've heard that quite a few times. The use case of rack-and-do-not-need-to-reconfigure-until-I-want-to-change-something seems quite common, but I suspect it's in often better organized ops teams where you don't have dozens of different people logging and not following the process. There is of course --check mode in ansible for testing if changes need to be made, as is common in these types of systems. Thankfully, both work, and you can definitely still set things up on cron/jenkins/etc as you like, if you want.
You can actually manage a deployment methodology this way (update x number of hosts kinds of things) pretty handily using Mcollective with puppet. You can even script your own plugins for it to do basically whatever orchestration flow you want. Pretty cool toolkit and I use it myself.