| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by stepri 240 days ago
	“Based on our investigation, the issue appears to be related to DNS resolution of the DynamoDB API endpoint in US-EAST-1. We are working on multiple parallel paths to accelerate recovery.” It’s always DNS.

10 comments

Nextgrid 240 days ago

I wonder how much of this is "DNS resolution" vs "underlying config/datastore of the DNS server is broken". I'd expect the latter.

link

babarjaana 240 days ago

Dumb question but what's the difference between the two? If the underlying config is broken then DNS resolution would fail, and that's basically the only way resolution fails, no?

link

mrktf 240 days ago

My speculation: 1st one - it just DNS fails and you can repeat later. second one - you need working DNS to update your DNS servers with new configuration endpoints where DynamoDB fetches its config (classical case of circular dependencies - i even managed get similar problem with two small dns servers...)

link

vidarh 239 days ago

DNS is trivial to distribute if your backing storage is accessible and/or local to each resolver, so it's a reasonable distinction to make: It suggests someone has preferred consistency at a level where DNS doesn't really provide consistency (due to caching in resolvers along the path) anyway, over a system with fewer failure points.

link

wdfx 240 days ago

... wonders if the dns config store is in fact dynamodb ...

link

kjsingh 240 days ago

DNS is managed by Route53 which has no dependency on Dynamodb for data plane

link

ej_campbell 239 days ago

Background on the service: https://aws.amazon.com/builders-library/reliability-and-cons...

link

CaptainOfCoit 240 days ago

I feel like even Amazon/AWS wouldn't be that dim, they surely have professionals who know how to build somewhat resilient distributed systems when DNS is involved :)

link

grogers 239 days ago

I doubt a circular dependency is the cause here (probably something even more basic). That being said, I could absolutely see how a circular dependency could accidentally creep in, especially as systems evolve over time.

Systems often start with minimal dependencies, and then over time you add a dependency on X for a limited use case as a convenience. Then over time, since it's already being used it gets added to other use cases until you eventually find out that it's a critical dependency.

link

paulddraper 240 days ago

Those aren't really that different.

That's a major way your DNS stops working.

link

huflungdung 240 days ago

I don’t think it is DNS. The DNS A records were 2h before they announced it was DNS but _after_ reporting it was a DNS issue.

link

koliber 240 days ago

It's always US-EAST-1 :)

link

shamil0xff 240 days ago

Might just be BGP dressed as DNS

link

bayindirh 240 days ago

Even when it's not DNS, it's DNS.

Sometimes it’s BGP.

I don't think that's necessarily true. The outage updates later identified failing network load balancers as the cause--I think DNS was just a symptom of the root cause

I suppose it's possible DNS broke health checks but it seems more likely to be the other way around imo

link

lkjdsklf 239 days ago

I don’t work for aws, but a different cloud provider so this is not a description of this incident, but an example of the kind of thing that can happen

One particular “dns” issue that caused an outage was actually a bug in software that monitors healthchecks.

It would actively monitor all servers for a particular service (by updating itself based on what was deployed) and update dns based on those checks.

So when the health check monitors failed, servers would get removed from dns within a few milliseconds.

Bug gets deployed to health check service. All of a sudden users can’t resolve dns names because everything is marked as unhealthy and removed from dns.

So not really a “dns” issue, but it looks like one to users

link

oneeyedpigeon 240 days ago

Downtime Never Stops!

link

commandersaki 240 days ago

Someone probably failed to lint the zone file.

link

DrewADesign 240 days ago

DNS strikes me as the kind of solution someone designed thinking “eh, this is good enough for now. We can work out some of the clunkiness when more organizations start using the Internet.” But it just ended up being pretty much the best approach indefinitely.

link

movpasd 240 days ago

Seems like an example of "worse is better". The worse solution has better survival characteristics (on account of getting actually made).

link

DrewADesign 240 days ago

I wouldn’t say it’s the worst… a largely decentralized worldwide namespace is not an easy thing to tackle and for the most part it totally works.

link

ifwinterco 239 days ago

I actually think the design of DNS is really cool. I'm sure we could do better designing from a clean slate today, especially around security (designing with the assumption of an adversarial environment).

But DNS was designed in the 80s! It's actually a minor miracle it works as well as it does

link

us0r 240 days ago

Or expired domains which I suppose is related?

link

dexterdog 239 days ago

That's why they wrote the haiku

link

indycliff 240 days ago

the answer is always DNS

link