Hacker News new | ask | show | jobs
by solatic 9 days ago
Use Ansible to update /etc/hosts on hundreds of thousands of hosts every time a host is added or removed?

Thanks for the laugh...

1 comments

Isn’t this just DNS with extra steps, anyways? Now Ansible is the DNS server, basically.
It replaces DNS's pull-based architecture (contact a DNS server to get the IP address) with a push-based one (push the IP addresses to each /etc/hosts file).

Suggesting that a push-based, Ansible-based architecture will scale to hundreds of thousands of targets, with such pushes happening hundreds if not thousands of times a day, is a junior-level idea at best, dark comedy if I'm being charitable, and professional malpractice at worst.

There are two kinds of junior engineers. Only hire one of them. (being very wrong is fine. Being CONFIDENTLY very wrong is not)
This sounds a bit like saying: don't use MySQL, because it can't scale to one billion requests per second. How many applications are actually running at that scale?
Did you read the original article?

> The Facebook / Meta outage was so significant

The author specifically called out the Meta outage, as if he was offering a prescription ("It's easy to configure systems with tools like Ansible or pyinfra at scale") that would have prevented Meta (at Meta's scale) from suffering an outage. The argument that Meta should not have used DNS except that Meta runs at a scale where DNS is necessary... who comes up with these arguments?

The whole thing is nonsense. DNS is terrifically reliable, complex schemes to update it are often fragile. Replacing DNS with /etc/hosts and... a complex scheme to update it with ansible isn't exactly a fix. The author even admits the high profile DNS incidents weren't actually DNS servers failing.

It is pretty insane to switch from DNS servers to pushing domain config to every single client every single update.

From TFA

>There are multiple(1) high-profile(2) incidents where DNS was involved. In these linked cases, the root-cause of the incident isn't the DNS system itself. Yet, because the root-cause affects the DNS service - which is in the critical path for virtually all services - the incident has such a huge impact.

From AWS incident report linked in TFA

>The root cause of this issue was a latent race condition in the DynamoDB DNS management system that resulted in an incorrect empty DNS record for the service’s regional endpoint