| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by erenst 1193 days ago

We used to manage 500+ servers with Ansible for almost 10 years. It was a nightmare.

With so many servers Ansible script would ocassionally fail on some servers (weird bugs, network issues, ...). Since the operations weren't always atomic we couldn't just re-run the script. it required fixing things manually.

Thanks to this and emergency patches/fixes on individual servers, we ended up with slightly different setup on the servers. This made debugging and upgrading a nightmare. Can this bug happen on all the server or just this one because it has a different minor version of package 'x'?

We switched to NixOS. It had a steep learning curve for us, with lots of doubts if this was the right decision. Converting all the servers to NixOS was a huge 2-year task.

Having all the servers running same configuration that is commited to GitHub, fully reproducable and tested in CI, on top of automatic updates of the servers done with GitHub action, was worth all the troubles we had with learning NixOS.

This entire blog post could be a NixOS config.

1 comments

pornel 1192 days ago

I realize Ansible is kinda slow and can be flaky, and wouldn't use it for 500 servers. However, for one beginner VPS I think it's fine.

The fact that it's not hermetic and perfectly reproducible is a major problem for a fleet, but for single user it's a benefit. It offers a graceful migration path from a snowflake server to a managed server, and still works even if you can't manage to do 100% of the config automatically.

link