> People don't believe me when I say how much DNS matters.
That's weird to me. I have been working in sysadmin/DevOps for over a decade, but it did not take me very long to learn that DNS outages cause massive problems.
Right, but everybody has to learn that at some point. And I happen to be somebody who teaches such things. The importance of DNS is hard to overstate, but I go to great lengths to do exactly that, to make a point ;)
Thank you! I'm glad that landed where I wanted it to. It was a lot of fun to put together. I keep threatening to make a video. I need a collection of DNS memes so that I can just sideshow them.
Amazing that down detector manages to stay up during these kinds of outages. Noticed it has been a little slow but they really have done a good job keeping it up even though large portions of the internet is down right now.
It's interesting that they report an AWS outage but there don't seem to be any issues there. Looks like their methodology is a bit too reliant on those speculative tweets from the first 5 minutes of all these sites going down. https://downdetector.com/status/aws-amazon-web-services/
> So many websites are down, are AWS servers down or something?
> Amazon web services is down which is affecting a lot of company web sites and services. Not sure what is going on.
> Miss us? @aldotcom and a whole bunch of other folks have been knocked off the internet by what appears to be an AWS attack/system failure. We'll be back. ?
Yep that's my point. I'm guessing that for a lot of sites they can verify if there's an outage pretty easily when they see a spike in reports, but for something like AWS unless they updated their status page (lol) or downdetector ran a bunch of stuff on there just to check with, I guess they don't have a good way to verify it.
Gotcha, yeah I guess I always just considered that out of scope for their service and that it’s just a report aggregator but I suppose you would expect it to be at least a little bit clever based on the “detector” name
You could run a local resolver like dnsmasq or Unbound that can “serve stale” on upstream failures, but that assumes the DNS failure is a client-facing resolver one.
From what I observed here, it was more internal DNS related: Newegg was serving an opaque “DNS failure” error page from Akamai’s front-end which is likely because their infra was failing to resolve names internally.
Just got booted out of Netflix on the PS4 because the console could no longer connect to Sony's license server. Netflix was working just fine by the way.
Ah thats whats going on. Happened to me as well, I just assumed that Sony is neglecting PS4 performance with its new system, while bogging it down with bloatware.
Their APIs are (or, were, last I suffered their use a few years ago) also terrible, eg blanket policy of refusing to cache any resource in the presence of "Vary" header, regardless of its value, and failure to honor standard HTTP headers... thankfully there are many other options for CDN, which are SO MUCH BETTER.
Akamai is their own worst enemy most of the time. Their prices are the highest, they trail on features, their documentation opaque, it takes an hour to propagate changes, etc. Only a few years ago you could only use SSL if you purchased their ridiculously expensive pci-dss plan - I thought they would defend that to their grave.
Better alternatives are Cloudflare, Fastly, AWS CloudFront.
Google Cloud CDN always seems to have very good latency but a very bare bones feature set and no edge compute I can identify. Support is always a huge red mark for Google anything.
> AMD automatically strips these headers out of requests to support caching for faster delivery.
> I need the Vary HTTP headers: AMD can cache the associated object if the Vary HTTP header contains only "Accept-Encoding" and "Gzip" is present in the Content-Encoding header
(AMD in this case standing for Akamai Media Delivery)
It wasn't that simple — IIRC, for a while Vary meant “don't cache anything, ever, under any circumstances” unless you made some custom configuration changes. Over time they _added_ support for just “Vary: Accept-Encoding” (IIRC less than a decade ago) and that was fragile. They improved that over time but it was painful for a number of years because there were various failure modes which meant things wouldn't be cached, or (IIRC) compression would be disabled for certain URLs sporadically if the first request for the option did not request transfer compression.
yeah, but only tech nerds see it, so it's okay. maybe it's a ploy to get the users to go to the real command set via CLI. make it so shitty nobody wants the UI, and goes back to the terminal. "if you're not a CLI ninja, then you shouldn't be using our product anyways!"
What's frustrating is that DNS is returning an address, instead of just failing, and so macos is caching that value (though it might be cloudflare doing that).
Wildcard DNS should be a prosecutable crime, punishable by no less than 20 years of hard labor.
(Edit: Probably should have made it clear that this was a joke)
Presumably you're referring to the practice of answering queries for nonexistent records with an A record belonging to an advertisement page? (instead of doing the right thing answering NXDOMAIN, presuming no records of another type also exist for the queried name.)
dnsmasq has a really useful feature for dealing with this: --bogus-nxdomain
I wonder if this is why LastPass is down. It has completely locked me out of my vault. You'd think it'd continue to work offline in a case like this. :/
Same path. It'll be very hard to move away from 1Password. App experience, sync, security features like key in addition to master password, family organizer-based recovery of an account, these are a few things that stand out.
Yeah, I use 1Password for every critical bit of information (SSN numbers, physical access codes) and a whole lot of less-critical stuff. I expect to be a customer for life.
That's about right for what it is, or at least how I think about it. There's no magic "unlock vault" button (by design), but an Organizer can kick off a workflow to reset a vault if need be. I have a few of the more tech-savvy family members set as organizers in my family in case something ever happens to me.
My favorite feature personally is the built-in 2FA support. Click and it logs into your account and copies the 2fa code to clipboard so just paste on next screen.
Multiple vaults too is nice but I know others have ways to limit exposure of passwords in similar manners.
Bitwarden offers this as well, but I don't really understand why you would want it. If someone compromises your password manager, 2FA is now worthless. Or am I misunderstanding how it works?
I prefer the browser addon for bitwarden over 1Password. Try editing a site in 1Password. It forces you to log into the full sir, whereas bitwarden can do almost everything right there in the addon.
This is also possible with the 1Password X extension, however there's a lot of feature segmentation and unclear messaging between the Desktop app-based version and 1Password X so I don't blame you for using the old one.
So, NS entries pointing to both? But then take the example your domain was in Route53 and AWS goes down. You can't configure the NS entries to avoid AWS DNS servers. Is the idea that child DNS servers detect the outage and cache the values in the name server(s) that remain up?
But then, the cached values from AWS take a while to clear, TTL never seems to be applied properly. It always feels like the worst case in such a scenario is you can point everyone at the right thing within 24 hours.
Configuring two NS entries is pretty standard, so surely most resolvers try one of the two, and if it's down try the other one? What else would be the point of having multiple nameservers? Then you just have to get two nameserver providers and make sure their settings stay synced, and point your domain to one nameserver from each.
Of course that requires the server to properly fail, i.e. stop responding to requests. That doesn't seem to be the case here
You set both services in your ns records. So every day they share the load for dns resolution. If one day one of them is down the client can/will use a different nameserver from your configuration.
DNS is fastest first* rather than main/failover. If AWS DNS was down your GCP DNS would have replied (if all is well) sooner than {timeout} so your visitor would still have a response
* Sort of. I think if the client doesn't get a reply from the server it picked randomly in 1s they move on to the next server, repeat until all fail
Ibthink if route53 was down. Your dns provider whouldn't able to go there. So it will go to the root who will give gcp one too. So your dns provider might try that.
(I don't know if this is how it works, but I thibk that's how it supposed to work)
You typically have four name servers for a domain, but they don’t all have to be hosted with the same company. Very handy when your DNS provider decides to brag they are unhackable and the hackers reply by immediately hacking them followed by DDoSing them to death.
gov.uk's traffic seems to be handled by Fastly, a well known CDN.
What I'm a bit surprised / unsure of is what happens when I run "dig ns gov.uk". The results are:
gov.uk. 21559 IN NS ns1.surfnet.nl.
gov.uk. 21559 IN NS auth50.ns.de.uu.net.
gov.uk. 21559 IN NS ns3.ja.net.
gov.uk. 21559 IN NS ns2.ja.net.
gov.uk. 21559 IN NS ns0.ja.net.
gov.uk. 21559 IN NS auth00.ns.de.uu.net.
gov.uk. 21559 IN NS ns4.ja.net.
Who is ja.net , uu.net and surfnet.nl ..?
EDIT: I see that ja.net i.e. jisc.ac.uk "manages the second level domain .gov.uk" -- https://www.jisc.ac.uk/domain-registry . I imagine that uu.net and surfnet.nl are there for redundancy
Ah sorry, you're indeed right. Turns out it was just the .service.gov.uk domain that uses GCP and AWS - I just thought that applied to the parent domain too.
Is it possible to see if/where is gov.uk using GCP or AWS for its domain zones? From what I can see -- that's not the case? Or am I looking in the wrong place?
Last time I tried setting NS to both cloudflare and digital ocean in my domain registry, cloudflare sent me an email saying the configuration is invalid and asked me to revert. Am I doing something wrong?
No, you have done everything right. At least from the point of view of DNS. That you can not use multiple nameservers is a limitation of Cloudflare (limit in the sense of: Cloudflare can only offer their services in the Free and Pro plan if they have full control over all nameservers).
It is relatively easy to make DNS highly redundant: just put multiple DNS server in data-centers which are as independent as possible (different geo locations, different ISPs). You can also use different DNS software and different OS (say BSD+Linix) to exclude correlated bugs. Root DNS server AFAIK use different software for this reason.
Problems starts when you want to easy make frequent changes and introduce complex software to manage DNS zones (and complexity usually comes with bugs).
The problem isn't DNS though, is it? The problem is that people don't necessarily use the redundancies on DNS?
The whole reason it takes a domain 24h to fully work with DNS is because it propagates the information other DNS servers, thus making not be a centralized service.
DNS doesn't 'propagate' except in the very limited case of zone-transfer publication, which... nobody really relies on these days. Registrars tell you it takes 24 hours to propagate to stop you from complaining to them about your ISP's DNS caching policy. The reality is: recursing DNS servers have caches, they respect TTLs, and for the most part that means that DNS changes should fully wash through within an hour for most changes (less if you keep your TTLs shorter).
It's an interesting question, as it's always been solved on the server side. All of the current problem is client side. That is, client resolvers that aren't using diverse providers, and only do things like round-robin with long timeouts.
From a client (DNS recursor) point of view there is no primary server. There is just multiple NS records which are equal. If one of them is down it can introduce resolving delays, but they are usually small. At least if something like Unbound or Bind is used. Unbound e. g. maintains infra-cache where it tracks RTT and errors for each server and avoid servers which are down.
Decentralized control of a centralized finite resource (domain names) requires consensus. For example, Joe Smith and Joe Blow both want joe.com.
You want a protocol that gives consistent "global" state without any centralized / trusted users - blockchain/bitcoin is one of the only technical solutions to provide that.
I agree that it's a garbage solution in practice, but that's why it's got cryptoshit bundled in.
A potential different solution to DNS monopoly, if that is a problem that needs solving, is multiple name-resolution providers that have differing records on what name points where. (The tradeoff is that an owner may need to register their name with multiple different providers).
Agreed. Blockchain is a convoluted solution, but it’s a solution for distributed consensus, if one feels that’s required. But in general I would argue the current root system has served us well and is open and free.
The world you describe, effectively with multiple roots, is coming. Russia have a switch (they’ve even tested it), to anycast out the root DNS IPs within the country, and block them externally. In theory this doesn’t make another “internet” (if IP space is still globally routable,) but in practice it does. Don’t be surprised if other countries follow suit (should they fail to leverage control of current infra via ITU or something.)
Add the name/IP to your local hosts file. It all works great then. Until the server changes IPs, anyways.
I did this with a website I liked which had let the domain expire. It worked for quite some time, until the VPS/whatever expired too. Good thing the Internet Archive is a thing.
The internet gets along quite fine without DNS. Packets route from network to network. DNS is an application-layer protocol. People often confuse the web with the internet. We use phone numbers for phone calls. It's conceivable with IPv6 you could nail up your IP address and use a QR code to make the addresses accessible. In a hundred years will DNS still be necessary? I don't think so.
So here's a weird question: Supposing companies multi-home for DNS, or whatever other essential service, via multiple service providers.
Whatever multi-home means, why can't there just be one service provider that does that? And are we sure that these service providers aren't already doing that as best we might hope for? (For instance, Amazon already has multiple zones, etc.)
I suppose the one thing this can't protect against is some sort of political (broadly defined) threat related to the company itself.
> Whatever multi-home means, why can't there just be one service provider that does that?
Many of these outages are due to pushing broken artifacts or configuration to production.
A single provider can pretty easily offer geographic or network topological redundancy, but administrative and/or technological independence is pretty hard to achieve in a single company.
I mean, I guess what I'm saying is that in theory a single provider could purposely keep two different departments that manage their own artifacts independently.
I believe EasyDNS can automatically push DNS settings to Route53 to host DNS in AWS. Doesn't protect you from fat-fingering a change, but you should be resilient to either EasyDNS or Route53 going down.
Using multiple providers for mostly static DNS is easy, pick one as primary and AXFR to the other and notifications and whatever. Or it's not too hard to keep a zone file in source control and sync it to the providers.
Using multiple providers for fancy DNS, like only providing IPs that pass healthchecks or geotargetting users to datacenters gets pretty hard, because the different providers have similar capabilities, but no uniform interface, so you've either got to do it manually, or you have to build out your own abstraction that is probably limiting.
If possible, insourcing DNS makes the most sense to me, because if you can't keep your service online, it's not the worst if your DNS is offline; and if you can keep your service online, you probably won't mess up your DNS too badly.
Most CDNs offer huge incentives for sending them more traffic, a lot of time you end up in a contract obligated to handle X requests and Y gigabytes of traffic per month. But personally I believe you should never have a single provider for anything - particularly when it’s acceptable for a company to cut you off with no warning or recourse.
So many sites being reported as down, but change your DNS to something else (e.g. Google 8.8.8.8 and 8.8.4.4) and, after flushing your DNS cache, the sites are available. I was unable to get to ups.com or newegg.com (why yes, I am expecting a new toy), but after switching DNS and flushing DNS cache, I was able to get to both.
Specifically, 1.1.1.1 provided bad addresses (as opposed to no addresses), and removing 1.1.1.1 fixed my problem. By then it had returned a bunch of bad addresses and I had to flush my DNS cache.
I am surprised financial institutions don't have any regulation for redundancy. The one that stuck out to me is the Navy Federal Credit Union website being down. I have not had any issues logging into mobile though for some of the reported sites.
> financial institutions don't have any regulation for redundancy
As CTO of a bank, I wasn’t aware of this. So either we wasted a ton of money and time constantly upgrading redundancy and business continuity technologies to satisfy our regulators… or this statement could be mistaken.
I'm not sure how easy it would be to regulate. But yeah. I've got a few short term trades in my brokerage account, and outages really throw a wrench into those.
because the way downdetector works is it just basically counts how many people are searching/visiting for <site> down and if it's much higher than typical it flags the site as down.
So if everyone searched "is google down" and visited the link on downdetector that was returned in the search, that would add to the downdetector count for that site.
Downdetector doesn't actually know if the site is up or down.
Downdetector only reports an issue if a significant number of users are impacted. To that end, Downdetector calculates a baseline volume of typical problem reports for each service monitored, based on the average number of reports for that given time of day over the last year. Downdetector’s incident detection system compares the current number of problem reports to this baseline and only reports an issue if the current volume significantly exceeds the typical volume of reports.
Was just browsing a website where the first page of a query worked, but visiting page 2 of the results was returning a DNS error. Was curious how and why only part of the site was down, but it looks like this was the problem as now the whole site is down.
What role does Akamai Edge DNS play in normal internet traffic? DNS responses usually get cached, as far as I understand correctly. And it is usually possible to change your DNS server to e.g. Google's and circumvent the outage. Does Akamai Edge DNS play a role on the server side?
If you use a CDN to front your traffic, you need the CNAME for www (or whatever) to be pointing at their DNS infrastructure, so they can return whichever closest POP is going to serve your traffic.
e.g. dig @1.1.1.1 www.nvidia.com +trace
... various things from the root ...
www.nvidia.com. 7200 IN CNAME www.nvidia.com.edgekey.net.
;; Received 83 bytes from 208.94.148.13#53(ns5.dnsmadeeasy.com) in 35 ms
So the main DNS is fine, but it'll never get an A record because the last link in the chain is toast -- edgekey being Akamai in this case, but all CDNs do this so they can route traffic. Normally, this is a good thing so they can shift traffic within 30 seconds on their side. Unfortunately, it also means it would take nvidia an two hours to point away from Akamai.
The trend these days are DNS TTLs of 60 - 300 seconds, to allow "Cloud agility" or something, so sites are exposed to a much larger risk of authoritative nameservers going down.
Services like Akamai use short TTLs for their edge services for a variety of reasons, not least because if one of their edge servers goes offline (for planned or unplanned reasons) it lets them sub in a new one and have it receive traffic immediately, rather than have a bunch of clients continue trying to talk to a dead node. So sure, you can increase those TTLs to trade 'what if the DNS server goes down?' risk with 'what if the edge server goes down?' risk...
But keeping the edge servers up and running is probably a lot harder - they need to scale more to handle traffic load, they have to actually handle client data, TLS termination, much more complex configuration.... so if I'm placing bets on which of those things is more likely to die on me, it's the edge node, not the DNS server.
Well it's been an hour now since I first noticed the effects and their service status still has no useful information or ETA for a fix. It's just an "emerging issue".
Strange thing about the duration of this outage... From logs I have, it seems to have lasted exactly one hour, from 15:38 to 16:38. Their Twitter account also said "disruption lasted up to an hour", though they incorrectly said it started at 15:46 (did it take 8 minutes for their monitoring to notice?).
That makes me think that whatever the fix was, it had to wait for some one-hour cache to expire before it took effect. I'm very interested to find out what the cache issue was, more so than what the original bug was.
Yes, was trying to do the same. Getting this 2nd jab has been a nightmare. Places listed as walk-in having Moderna, don't and they ran out of it when I went to get my secheduled jab. Ringing 119 just ends up in a dead line, then this outage. Fun.
With all due respect, having also run auth DNS servers in the 90s, and seen the inside of Akamai’s CDN/DNS setup more recently, it isn’t remotely at the same level of scale or sophistication.
"Scale and sophistication" scale relatively with time. Those servers we ran were relatively at the same level of scale and sophistication for their time. The only differentiator here is uptime, which has gotten worse as time has gone on. Five 9s used to be the standard. Three 9s seems to be the new standard.
DNS is designed to be fault tolerant. Such a design, however, is often not leveraged correctly; the implementation of DNS can be and frequently is subject to SPOFs.
That's even worse if true; despite HNers creating a storm in a tea cup on DOSing a blog of a service not using K8s when having a blog is not their main service. [0].
Either way, the joke's is now on the HNers in that thread.
https://soundcloud.com/ryan-flowers-916961339/dns-to-the-tun...