Hacker News new | ask | show | jobs
by Joe8Bit 1666 days ago
I know we’ve all collectively accepted that DNSSEC is a terrible, complicated blight on the world but I still find it incredible that that an organisation with Slacks resources and access to expertise can’t make it work.
4 comments

You say Slack, and I agree, that's telling, but you have to add to that AWS itself, which had a DNSSEC bug in its wildcard record support as well. Slack and AWS together couldn't make this feature work. Further: the open source tooling Slack (like most places) relies on for deployment is also DNSSEC-hostile: one of their problems is that Terraform's Route53 provider doesn't safely disable DNSSEC once enabled. It's a mess everywhere you look.

I think another interesting question here is why Slack bothered in the first place. As was pointed out on the other DNSSEC thread today: practically nobody in the technology industry uses DNSSEC in the first place. Presumably, Slack did DNSSEC (they don't anymore!) in service of FedRAMP compliance. Why? Slack has one of the most popular products in all of computing. What bad thing was going to happen if they said "nah, we're going to go with Cloud.gov's recommendation and not this FedRAMP document"?

> Presumably, Slack did DNSSEC (they don't anymore!) in service of FedRAMP compliance. Why? Slack has one of the most popular products in all of computing. What bad thing was going to happen if they said "nah, we're going to go with Cloud.gov's recommendation and not this FedRAMP document"?

As just one example, it's tremendously difficult, if not impossible, to sell your cloud-based SaaS to Navy customers if you have open FedRAMP compliance issues that you aren't at least working to address.

I say "compliance" instead of "security" for a reason as well, as "compliance" truly runs the show in Navy cybersecurity. And if you want to sell to that market (and it's hardly just Navy who runs this way), it's easier to check the checkboxes than it is to argue about whether NIST is right or cloud.gov is right.

Gotta be Fedramp compliant to do business with the US government. Even worse, you have to be Fedramp compliant to work with anyone who works with the US government. From a business (if not an engineering) standpoint, there's plenty to gain in going through the motions
As was pointed out downthread, there are tech companies that are "more" FedRAMP compliant (FedRAMP "High") without DNSSEC support.

(Kenn White points out on Twitter that some of this may be due to grandfathering --- though, the FedRAMP DNSSEC requirement is pretty old.)

I don't know about FedRAMP, but with other government requirements, the easiest way to get an exception was to fail badly at implementing the retirement.

When the DOD tried to mandate Ada, lots of projects were bid as Ada, then switched to C++ at the very first sign of any trouble whatsoever. I would 100% believe it if someone told me that this horrible rollout could be leveraged into an exemption from needing DNSSEC

We had to do DNSSEC (for a couple of "system relevant" services) too.

Was it a hard requirement? No, but the fat fingered audit companies really like to tick that "should" box green and would be more lenient with other debatable findings, so it was suddenly "in our best interests" to comply.

It's a business decision. Good luck selling software subscriptions to federal agencies without FedRAMP compliance.

I'm pretty surprised that slack doesn't have a more robust testing network. Is it really that hard to set up another DNS on Route53 for staging these changes? Idk, but that type of thing is the least you can do if you want some FBI agents to discuss active investigations on your chat platform...

None of this, none of it at all, has anything to do with Slack's ability to safely host conversations from the FBI. Whatever challenges they have with that are entirely orthogonal to this stupid performative stunt DNS configuration.

(There's a whole thread here, and more on Twitter, getting into the actual details of what FedRAMP and NIST require here, and engaging with the fact that Slack is the only large tech company in the past several years to have attempted to flip the DNSSEC switch on.)

I work for a company that maintains DNSSEC on our FedRAMP deployment. It's not unreasonable to ask for signed DNS records if feds are going to hit them.

Your blog post makes the supposition that DNSSEC is only being pushed as an alternative means of security to CA for TLS. While it makes a compelling case that this isn't realistic, there are other security concerns that occur from the compromise of DNS records. If the government is going to use a DNS record, it should be signed by a zone owner.

Slack is actually a good use case for this security enforcement, because they maintain a handful of domains that are extremely authoritative for their messaging service[1]. If you can't maintain a security protocol on four domains that are crucial to the operation of your service, you maybe aren't cut out to supply software for the government.

1: https://slack.com/help/articles/360001603387-Manage-Slack-co...

I've done security work for products deployed at DOD and in other sensitive agencies, and had firsthand experience with USG infosec, and the idea that the USG sets any kind of useful standard for infrastructure security is risible.

Unfortunately, the GSA product market is its own bubble, as is people who work in IT for the USG in any capacity, and so it's easy to see how people with limited exposure to modern industry practice --- experiences almost wholly gated through vendors that snake through the GSA acquisition process --- might believe themselves to be operating several levels above where they actually are.

I would take Slack's security practice --- their infrasec, their corpsec, their software security, the whole shebang --- over anything done in any USG agency. Slack is better at this than their USG clients are, full stop. And Slack, while strong, is far from the S tier of industry security teams.

Well, the Slack security team seems to think that DNSSEC is important. Even for their workspace domains.

I just want to hammer home the point that requiring service providers to get their DNS records signed by DNS zone owners is a reasonable ask for USG software service vendors. Even if DNSSEC isn't capable of securing the whole internet.

Because FedRAMP compliance is required for many US federal (and now some state) customers, which Slack can charge a premium.
No tech company is infallible. All of them have outages, some lasting hours, even days.

Complex systems can and will fail. Try to do better, of course, but let’s acknowledge that perfection will always exceed our grasp. The world will continue to turn regardless.

One day it might just be your turn to break production.

The subtext here isn't that Slack is bad at this (they are not), but that DNSSEC is somehow intrinsically unsafe (it probably is).
I agree with your points about DNSSEC (disclaimer: I have not had the pleasure of having to implement it myself in infra), but was attempting to communicate that DNSSEC isn’t the only area of ops that folks get exposed to these sorts of unknowns or edge cases, and that no amount of resourcing enables you to avoid these issues. For Slack, it was DNSSEC. For Roblox, Consul. Facebook/Insta, software defined BGP. Akamai, DNS.

Perhaps I did not read the room appropriately. Mea culpa.

Did Roblox finally come out with their postmortem blaming Consul? As far as I know we just assumed it, but have had no update since October.
"It turned out that some resolvers become more strict when DNSSEC signing is enabled at the authoritative name servers, even while signing was not enabled at the root name servers (i.e. before DS records were published to COM nameservers). This strict DNS spec enforcement will reject a CNAME record at the apex of a zone (as per RFC-2181), including the APEX of a sub-delegated subdomain"

Slack's second attempt wasn't a DNSSEC problem. Slack depended on a permissive fallback of revolvers when encountering a plain DNS protocol error. It is similar to how some websites in the past relied on permissive browsers implementation when facing broken HTML/JS/CSS. Slack fixed their broken DNS as a result of this.

Slack's third attempt was not the fault of Slack but rather a software bug at Amazon. I would make the argument that Amazon's primary product isn't DNS services, but they did fixed their bug after this.

The general conclusion I get from the article is not that DNSSEC is broken, nor that is too complicated. It is that when doing changes with your core infrastructure to make it more secure, bugs that may have been laying dormant might pop up and bite. I am sure some people has had that experience in domains outside of DNS.

You are not wrong, but by steering clear of DNSSEC, Slack would not have had the outage they did.

What one can't ignore is the underlying chicken-and-egg problem that DNSSEC must overcome: Not many DNSSEC deployments and hence not much of it has been tested in the real-world, which results in colossal outages despite the attention of some of the most qualified engs, including the ones running one of the largest nameserver deployments in the world.

TLS and WebPKI has had a similar, perhaps even more painful route to ubiquity. So, this problem isn't unique to DNSSEC. What isn't working in DNSSEC's favour is, the world has not just moved on, but it has built solutions atop DNS' weaknesses, like it once did with IPv4 and NAT. Internet's strong network-effects coupled with its heterogeneity, make battling "the System" an even harder proposition.

See also: System design explains the world: Vol 1, https://apenwarr.ca/log/20201227

I know HN has collectively accepted but every time I'm associated with an organisation that pays for a penetration test it comes in as a high risk finding, so much so that I've given in to deploying it to avoid sitting with non-technical managers doing the "here's why I disagree" all over again. Outside of this group I definitely feel like I'm on my own in that view.