Hacker News new | ask | show | jobs
by necovek 18 days ago
It is not really true that DNS is for people only: it is used as an aliasing system, for load balancing, and for caching (with no cache invalidation mechanism other than ahead-of-time TTL setting).

It is used to make entire protocols work (MX records for email, but SRV records are used for much more).

Now, if we do look at the most basic of basic DNS roles — mapping a human readable name to arbitrary set of numbers identifying a machine on the network — we should consider how do we avoid some of the issues while keeping all of the benefits of DNS.

Eg. if we indeed "materialize" machine identifiers, we lose the ability to do virtual hosting (domains not passed in) or fix a problem with just a DNS update (eg. treating load-balancing machines like cattle).

The author jumps immediately to, arguably, ill advised materialization techniques like /etc/hosts, without considering all that DNS does for a complex, real world system and what goes missing.

2 comments

"It is not really true that DNS is for people only" Yes, "Any problem in computer science can be solved with another level of indirection... except for the problem of too many layers".

DNS is one mechanism of adding a layer of abstraction.

No disagreement there, but this layer of abstraction (mapping names to unique numeric machine identifiers) seems unavoidable for the most part: even the OP does not propose doing away with it, just replacing one tech with another (eg. DNS with /etc/hosts).

So let's not make a general argument when there are specifics to be discussed — do you have an argument for why mapping names to IDs is an abstraction too much here?

- note I was talking about internal infrastructure, not public services

- DNS load balancing is not that important for internal services in most Cases? Would only use it if alternatives won’t work.

- the virtual host issue is really adressed by /etc/hosts, I thought that was obvious, I now regret not explicitly adressing it.

The examples you cite (eg. 2021 Facebook outage) have nothing to do with DNS being used for internal infrastructure.

In the other example (Amazon DynamoDB issue), the problem is with dynamically choosing from a large dynamic pool of IP addresses for a service — DNS is but one mechanism to do it. If it wasn't DNS, it could have been something else that did that job that was broken. Even /etc/hosts if it was updated with an empty record.

What I am saying is that your analysis is not defining the problem you want solved exactly, your examples are not backing up your proposal or analysis, and you are ignoring all the things DNS does both for public and private infrastructure. You seem to have some intuition about this adding complexity and thus being a risk (which is true), but you need to do a better job of connecting and analysing real risks and proposed solutions (and their comparative performance).

I do state in the article that in the examples DNS isn't the root-cause, but the blast radius is very significant. Regardless of the topic of external/internal services, isn't it remarkable that a group of very smart and well-paid people create such circular dependancies?

Yet, I'm not arguing for Facebook or similar size companies to ditch DNS internally. I'm making the argument for much smaller organisations to pause and think where their own risks lie and if it would make sense to cut out DNS to reduce risk. Whatever process you used as an organisation to update DNS in a safe manner, you still use with the alternative solution, that doesn't change.

That said, even an broken update to /etc/hosts is probably easier and faster to recover from than a broken DNS service that everything is tied to and due to TTL caching, can take much longer to resolve.

As said, I believe you are simplifying the problem significantly and thus making general claims which do not hold water.

Eg. even if you are DNS based but have direct SSH access to the system which has a query cached and root access on it (you need to manage all this too!), you can temporarily edit /etc/hosts or /etc/resolv.conf to workaround the cached value.

So my suggestion remains to keep working on a better argument and scenario by trying to understand exactly where your intuition applies — but be critical to yourself too, and think through if your alternative has any other cons too.

By doing so, you will likely find why everybody defaults to DNS for a named service registry in a sense.

Or you can just clear the cache!
> even an broken update to /etc/hosts is probably easier and faster to recover from than a broken DNS service

I fail to see how, especially if you were to accidentally break your ability to push those updates out.

I think what you are really arguing for is more people that properly understand (and implement) DNS.

A smaller organisation should have a much easier time implementing internal DNS and it should be pretty damn stable and reliable. Unfortunately a lot of people dont properly understand it (not that you need to be a complete expert - just competent) and hence we always have the mantra "Its always DNS" when something goes down.

Usually complicated beskoke systems engineered for internal use are better left for really large orgs that can hire the talent to maintain and properly implement it (and have the manpower to have enough people in the first place always on staff to maintain it when the first person gets sick or something)

TTL caching

We are talking about 300sec (=5Min), this is never an issue