Hacker News new | ask | show | jobs
by daper 1665 days ago
From the described mistakes two come from lack of understanding how exactly DNS works. But I agree it's in fact hard, see [1]).

1. "This strict DNS spec enforcement will reject a CNAME record at the apex of a zone (as per RFC-2181), including the APEX of a sub-delegated subdomain. This was the reason that customers using VPN providers were disproportionately" - This is non intuitive and maay people are surprised by that. You cannot create any subdomain (even www.domain.tld) if you created "domain.tld CNAME something...". Looks like not every server/resolver enforces that restriction.

2. "based on expert advice, our understanding at the time was that DS records at the .com zone were never cached, so pulling it from the registrar would cause resolvers to immediately stop performing DNSSEC validation." - like any other record, they can be cached. DNS has also negative caching (caching of "not found responses". Moreover there are resolvers that allow configuring minimum TTL that can be higher that what your NS servers returns (like unbound - "cache-min-ttl" option) or can be configured to serve stale responses in case of resolution failures after the cached data expires [2]. That means returning TTL of "1s" will not work as you expect.

[1] https://blog.powerdns.com/2020/11/27/goodbye-dns-goodbye-pow... [2] https://www.isc.org/blogs/2020-serve-stale/

1 comments

My (basic and conservative) mental model that "in DNS, everything including the lack of presence of a thing can be cached" is why I'm very cautious before rolling out anything from DKIM to DNSSEC. A deep understanding of specifications is vital. I'm somewhat surprised an organization of Slack's scale didn't have a consultant on the level of "I designed DNSSEC" on hand for this.
DNS is a bit like network engineering, in that simpler errors has the tendency to have large impacts that prevent trial and error. Before working as a sysadmin I thought that doing experimental lab setups was something only researchers and student did, but when you have an old system up and running, it can be quite difficult to get in there and make changes unless you are very sure about what you are doing.

Like networking there can also be existing protocol errors and plain broken things that has for one reason or an other been seemingly working for decades without causing a problem. Internet flag day is one of those things that pokes at those problems, and maybe one day we will see a test for CNAME at the apex.

It's worth noting that this by itself is a reason not to do ambitious security things (and a global PKI is nothing if not ambitious) at the layer of DNS. It's an extension of the end-to-end argument, or at least of the the logic used in the Saltzer and Reed paper: because it's difficult and error-prone to deploy policy code in the core of the network (here: the "conceptual" core of the protocol stack), we should work to get that policy further up the stack and closer to the applications that actually care about that policy.

The Saltzer and Reed paper, if I'm remembering right, even calls out security as specifically one of those things you don't want to be doing in the middle of the network.

See also: Zero Trust / BeyondCorp.

When people start to implement security at the BGP layer, which will likely occur some time soon, we will see things break. We will also see BGP fail if we don't do anything as the protocol is ancient, got an untold amount of undefined behavior between different devices and suppliers, and is extremely fragile.

There has been many that has suggested that we should just scrap the whole thing called The Internet and start from scratch. It would be safer, but I don't think it is a serious alternative. DNS, BGP, IP, UDP, TCP, and HTTP to name a few are seeing incremental changes, and the cost is preferable over the alternative of doing nothing. Ambitious security things would be much less costly if we had working redundancy in place, which is one of those things that flag day tend to illustrate. Good redundancy and people won't notice when HTTP becomes HTTP/2 that later becomes HTTP/3. It also helped development at google that when they added QUIC, they controlled both ends of the connection.

> There has been many that has suggested that we should just scrap the whole thing called The Internet and start from scratch. It would be safer, but I don't think it is a serious alternative.

See second-system effect:

> https://en.wikipedia.org/wiki/Second-system_effect

Yep - in this, as in many things in life, expert knowledge is knowing what experiments and tests you should be doing as much as which ones you can avoid.
> I'm somewhat surprised an organization of Slack's scale didn't have a consultant on the level of "I designed DNSSEC" on hand for this

If it takes a designer of DNSSEC to implement it, then how should I, a peasant implement DNSSEC for my infra?