“Based on our investigation, the issue appears to be related to DNS resolution of the DynamoDB API endpoint in US-EAST-1. We are working on multiple parallel paths to accelerate recovery.”
Dumb question but what's the difference between the two? If the underlying config is broken then DNS resolution would fail, and that's basically the only way resolution fails, no?
My speculation: 1st one - it just DNS fails and you can repeat later. second one - you need working DNS to update your DNS servers with new configuration endpoints where DynamoDB fetches its config (classical case of circular dependencies - i even managed get similar problem with two small dns servers...)
DNS is trivial to distribute if your backing storage is accessible and/or local to each resolver, so it's a reasonable distinction to make: It suggests someone has preferred consistency at a level where DNS doesn't really provide consistency (due to caching in resolvers along the path) anyway, over a system with fewer failure points.
I feel like even Amazon/AWS wouldn't be that dim, they surely have professionals who know how to build somewhat resilient distributed systems when DNS is involved :)
I doubt a circular dependency is the cause here (probably something even more basic). That being said, I could absolutely see how a circular dependency could accidentally creep in, especially as systems evolve over time.
Systems often start with minimal dependencies, and then over time you add a dependency on X for a limited use case as a convenience. Then over time, since it's already being used it gets added to other use cases until you eventually find out that it's a critical dependency.
I don't think that's necessarily true. The outage updates later identified failing network load balancers as the cause--I think DNS was just a symptom of the root cause
I suppose it's possible DNS broke health checks but it seems more likely to be the other way around imo
I don’t work for aws, but a different cloud provider so this is not a description of this incident, but an example of the kind of thing that can happen
One particular “dns” issue that caused an outage was actually a bug in software that monitors healthchecks.
It would actively monitor all servers for a particular service (by updating itself based on what was deployed) and update dns based on those checks.
So when the health check monitors failed, servers would get removed from dns within a few milliseconds.
Bug gets deployed to health check service. All of a sudden users can’t resolve dns names because everything is marked as unhealthy and removed from dns.
So not really a “dns” issue, but it looks like one to users
DNS strikes me as the kind of solution someone designed thinking “eh, this is good enough for now. We can work out some of the clunkiness when more organizations start using the Internet.” But it just ended up being pretty much the best approach indefinitely.
I actually think the design of DNS is really cool. I'm sure we could do better designing from a clean slate today, especially around security (designing with the assumption of an adversarial environment).
But DNS was designed in the 80s! It's actually a minor miracle it works as well as it does