Hacker News new | ask | show | jobs
by keville 1718 days ago
https://lists.dns-oarc.net/pipermail/dns-operations/2021-Sep...
3 comments

Holy shit. That's bad.

What this suggests is that Slack, for reasons passing understanding, enabled DNSSEC on their zones (with a DS record that essentially turns DNSSEC on, and the accompanying key records) --- then disabled DNSSEC by pulling all the records. But the DS records are in caches; validating resolvers go looking for the keys, which don't exist, and say "welp, I guess Slack.com doesn't exist".

I wonder if they are using tooling that doesn't properly retain DNSKEY records for DS that recently removed? This is one of the reasons we perform controlled automated key rotation and removal in DNSimple, so that we can ensure we retain the keys in the authoritative zone on each key rollover giving the DS records time to expire from caches.
It was especially bad since their status page wouldn't even resolve! I eventually just restarted my local caching DNS server.
We had a DNS related outage with route53. Some of our zones just lost some records and then they reappeared. Could that explain what happened to slack's DNSSEC related records?
A good question, and apparently enough to elict a response: No!

> This issue was caused by our own change and not related to any third-party DNS software and services.

Aren't you, in fact, the same Thomas Ptacek who has repeatedly claimed that DNSSEC is so irrelevant that events like this would go essentially unnoticed?

Edited to add, e.g. https://news.ycombinator.com/item?id=22400167

> DNSSEC is moribund and almost nobody uses it; in reality, the DNSSEC root private keys could land on Pastebin tomorrow and nothing would "break"

We have this whole thread here about a "service disruption" for Slack, and nobody leaked the "root private keys" just one person made a dumb error and it blew up their site.

No, I'm the Thomas Ptacek who has repeatedly claimed that the only impact DNSSEC is going to have on the Internet is causing outages like this. It's right there in the blog posts; in fact, it's even in the 2007 blog posts I wrote about this on the Matasano blog.
> just one person made a dumb error and it blew up their site

yeah, the dumb error they made was "using DNSSEC"

I'm not going to defend DNSSEC here, because this outage and others continue to support tptacek's perspective on its usefulness.

But, some governments are requiring DNSSEC, which regardless of its usefulness, puts companies that want those contracts in a bit of a bind.

Perhaps it would make sense to split domains such that DNSSEC guarded ones would not negatively impact ones that do not have DNSSEC.

The USG DNSSEC requirements, which seem to be a part of what happened, are fragmented and incoherent. OMB withdrew DNSSEC requirements in 2018, and CLOUD.GOV doesn't support it. But some older requirements documents still have them, and need to be updated.

The important top-line thing to know here is that virtually all tech companies eschew DNSSEC (you can verify that for yourself with `host -t ds stripe.com`; substitute any other company for Stripe.

DNSSEC-quarantine TLDs are a good idea.

If anyone else is curious about the OMB cycle, here's a pretty good explanation with links to the source memos:

https://cloud.gov/docs/compliance/domain-standards/#dnssec

Huh, TIL:

> Both Google and Cloudflare have a publicly accessible feature to flush the cache for a domain, so anyone could have done it: > https://developers.google.com/speed/public-dns/cache > https://1.1.1.1/purge-cache/

Quite useful feature indeed.

Unlikely to help in this particular case (which is a root-level DS record).
It did help. As it removes also the DS record cache. Just try to make DNS query using those resolvers.
Awesome handy tip, thanks!
https://dnsviz.net/d/slack.com/YVXX_g/dnssec/ the dnsviz analysis showing the slack.com zone DNSKEY existing at 12:55, followed by the the .com zone DS record at 15:30. However, the next analysis at 17:24 shows both the .com zone DS and slack.com DNSKEY records have disappeared!

Given that the slack.com DNSKEY shows up with a 1h TTL and the .com zone DS has a 24h TTL, they are screwed in the presence of cached slack.com DS records from the .com zone. Do not throw away your DNSKEY until your delegation's TTL has absolutely positively surely expired from any resolver caches!

The slack.com domain is an AWS Route 53 zone, I'd be really interested to see a post-mortem explaining what happened here. Are they unable to recover the KSK/ZSK and restore the DNSKEY/etc records?

Great analysis, thanks!

Slack support says that users should tell their ISPs to invalidate the DNS cache for slack.com https://status.slack.com/2021-09/06c1e17de93e7dc2 (access with 8.8.8.8 as resolver - fallback https://slack-status.azureedge.net/)

Since the faulty DS record was in .com, everyone has a max wait-for-ttl-to-expire time of 24h.

Google/Cloudflare etc. seem to also invalidate .com caching very quickly, 8.8.8.8 quickly was the first workaround.

Meanwhile, 14 hours later, DTAG in Germany still does not resolve. The default resolvers have dnssec enabled.

dig slack.com +cd

tells the resolver to skip dnssec validation tests, and then it works again. Screenshots with the command output in https://twitter.com/dnsmichi/status/1443840645513293853?s=2

Very interested in the post-mortem analysis. I think there were similar mistakes as with nasa.gov incident and the comcast analysis in 2012: https://www.internetsociety.org/blog/2012/01/comcast-release...

Learnings for me:

- dnstracer (https://gitlab.com/dnsmichi/dotfiles/-/blob/main/Brewfile#L5...) helps with detecting missing glue records, but not dnssec

- dnstrace (https://github.com/rs/dnstrace) is a better alternative with dnssec