Hacker News new | ask | show | jobs
by stepri 240 days ago
“Based on our investigation, the issue appears to be related to DNS resolution of the DynamoDB API endpoint in US-EAST-1. We are working on multiple parallel paths to accelerate recovery.”

It’s always DNS.

10 comments

I wonder how much of this is "DNS resolution" vs "underlying config/datastore of the DNS server is broken". I'd expect the latter.
Dumb question but what's the difference between the two? If the underlying config is broken then DNS resolution would fail, and that's basically the only way resolution fails, no?
My speculation: 1st one - it just DNS fails and you can repeat later. second one - you need working DNS to update your DNS servers with new configuration endpoints where DynamoDB fetches its config (classical case of circular dependencies - i even managed get similar problem with two small dns servers...)
DNS is trivial to distribute if your backing storage is accessible and/or local to each resolver, so it's a reasonable distinction to make: It suggests someone has preferred consistency at a level where DNS doesn't really provide consistency (due to caching in resolvers along the path) anyway, over a system with fewer failure points.
... wonders if the dns config store is in fact dynamodb ...
DNS is managed by Route53 which has no dependency on Dynamodb for data plane
I feel like even Amazon/AWS wouldn't be that dim, they surely have professionals who know how to build somewhat resilient distributed systems when DNS is involved :)
I doubt a circular dependency is the cause here (probably something even more basic). That being said, I could absolutely see how a circular dependency could accidentally creep in, especially as systems evolve over time.

Systems often start with minimal dependencies, and then over time you add a dependency on X for a limited use case as a convenience. Then over time, since it's already being used it gets added to other use cases until you eventually find out that it's a critical dependency.

Those aren't really that different.

That's a major way your DNS stops working.

I don’t think it is DNS. The DNS A records were 2h before they announced it was DNS but _after_ reporting it was a DNS issue.
It's always US-EAST-1 :)
Might just be BGP dressed as DNS
Even when it's not DNS, it's DNS.
Sometimes it’s BGP.

/s

I don't think that's necessarily true. The outage updates later identified failing network load balancers as the cause--I think DNS was just a symptom of the root cause

I suppose it's possible DNS broke health checks but it seems more likely to be the other way around imo

I don’t work for aws, but a different cloud provider so this is not a description of this incident, but an example of the kind of thing that can happen

One particular “dns” issue that caused an outage was actually a bug in software that monitors healthchecks.

It would actively monitor all servers for a particular service (by updating itself based on what was deployed) and update dns based on those checks.

So when the health check monitors failed, servers would get removed from dns within a few milliseconds.

Bug gets deployed to health check service. All of a sudden users can’t resolve dns names because everything is marked as unhealthy and removed from dns.

So not really a “dns” issue, but it looks like one to users

Downtime Never Stops!
Someone probably failed to lint the zone file.
DNS strikes me as the kind of solution someone designed thinking “eh, this is good enough for now. We can work out some of the clunkiness when more organizations start using the Internet.” But it just ended up being pretty much the best approach indefinitely.
Seems like an example of "worse is better". The worse solution has better survival characteristics (on account of getting actually made).
I wouldn’t say it’s the worst… a largely decentralized worldwide namespace is not an easy thing to tackle and for the most part it totally works.
I actually think the design of DNS is really cool. I'm sure we could do better designing from a clean slate today, especially around security (designing with the assumption of an adversarial environment).

But DNS was designed in the 80s! It's actually a minor miracle it works as well as it does

Or expired domains which I suppose is related?
That's why they wrote the haiku
the answer is always DNS