| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by iowahansen 2900 days ago

Grrr. So much for global redundancy.

What is going to be faster? Updating DNS records with TTL 3600 to point to a single data center or Google fixing their problem.

We host DNS at AWS, but servers in GCP. Should we use AWS's automatic DNS failover feature to cover for such a case?

3 comments

colmmacc 2900 days ago

AWS engineer here, I was lead for Route 53.

We generally use 60 second TTLs, and as low as 10 seconds is very common. There's a lot of myth out there about upstream DNS resolvers not honoring low TTLs, but we find that it's very reliable. We actually see faster convergence times with DNS failover than using BGP/IP Anycast. That's probably because DNS TTLs decrement concurrently on every resolver with the record, but BGP advertisements have to propagate serially network-by-network. The way DNS failover works is that the health checks are integrated directly with the Route 53 name servers. In fact every name server is checking the latest healthiness status every single time it gets a query. Those statuses are basically a bitset, being updated /all/ of the time. The system doesn't "care" or "know" how many health status change each time, it's not delta-based. That's made it very very reliable over the years. We use it ourselves for everything.

Of course the downside of low TTLs is more queries, and we charge by the query unless you ALIAS to an ELB, S3, or CloudFront (then the cost of the queries is on us).

toast0 2900 days ago

_most_ of the traffic will move in response to DNS changes, but there's always a group of resolvers that keep your old IPs for an unreasonable amount of time. I've taken machines out of DNS rotations with short TTLS (I think 5 minutes, but maybe 1 hour) and had some amount of traffic on them for weeks. After a reasonable amount of time, too bad for them, but when I can work behind a 'real' load balancer it's nice to be able to actually turn off the traffic.

iowahansen 2900 days ago

Interesting, thank you. So a potential mitigation strategy could look like this:

- Route 53 failover record * primary record: Google global load balancer IP * secondary record: Route 53 Geolocation set (really need that latency) - Elastic Load balancer record per region * routes to mirror region GCP IP address (ELB's application load balancer seems to able to point to AWS external IPs) * optionally spin up mirror infrastructure in AWS

Seems brittle. Does Azure support global load balancing with external IPs?

Does anyone have such (or similar) setup actually in production? How did it work today?

manigandham 2900 days ago

That would work, and Azure Traffic Manager does support external IPs. CDNs like Cloudflare and Fastly also have built-in load-balancing where they use their internal routing tables for faster propagation.

fastest963 2900 days ago

I haven't been able to make an ELB target be an external IP. What did you mean by "ELB's application load balancer seems to able to point to AWS external IPs"?

iowahansen 2900 days ago

https://aws.amazon.com/elasticloadbalancing/details/#details

IP addresses as Targets You can load balance any application hosted in AWS or on-premises using IP addresses of the application backends as targets. This allows load balancing to an application backend hosted on any IP address and any interface on an instance. You can also use IP addresses as targets to load balance applications hosted in on-premises locations (over a Direct Connect or VPN connection), peered VPCs and EC2-Classic (using ClassicLink). The ability to load balance across AWS and on-prem resources helps you migrate-to-cloud, burst-to-cloud or failover-to-cloud.

Looks like you need an active VPN connection to access external IPs.

trout 2900 days ago

That feature requires you to use a private IP address, so if you have a VPN or Direct Connect to another location you could load balance across locations. In the case of the global load balancers those will be public addresses though.

"The IP addresses that you register must be from the subnets of the VPC for the target group, the RFC 1918 range (10.0.0.0/8, 172.16.0.0/12, and 192.168.0.0/16), and the RFC 6598 range (100.64.0.0/10). You cannot register publicly routable IP addresses."

[1] https://docs.aws.amazon.com/elasticloadbalancing/latest/netw...

orf 2900 days ago

> Of course the downside of low TTLs is more queries

I was diagnosing a networking issue from one of our service providers last Friday. For whatever indeterminate reason DNS responses from R53 took upwards of 10-15 seconds to return. While I appreciate the non-configurable default TTL of 60 seconds for ELB is not plucked out of thin air and that actual issue seemed to be on the service providers side, the lower limit seems far too low for medium/high latency networks. I wish it was configurable.

What's worse is it looks like it's our site that is the issue, so we get the complaints and I have to dig through wireshark logs.

colmmacc 2900 days ago

If you have a very high latency network, say a satellite link, make sure that your near-side resolver supports pre-fetching! Unbound is a good choice.

jniedrauer 2900 days ago

I run unbound on my own workstations. It's so lightweight, you'd never even notice it, but it definitely makes browsing a little more snappy.

AmericanChopper 2900 days ago

>There's a lot of myth out there about upstream DNS resolvers not honoring low TTLs, but we find that it's very reliable

I've done a few unplanned DNS failovers, and I agree with this. What can be real trouble though is if you're running a B2B app, and your customers corporate networks can be configured in any strange way. I've met real network admins who think they need to have high TTLs everywhere in order to protect themselves from root DNS DDoSes.

nh2 2900 days ago

There really are locations where DNS resolvers don't honor TTL.

For example, the public wifi in the last Hackspace in Munich I visited did not honour my 10 second TTL.

But in my opinion there aren't enough of them to justify not using short TTLs. It's their problem after all if they don't honour websites' settings: Then they will see downtime when nobody else does.

voltagex_ 2900 days ago

Do you mean it was cached for longer than 10 seconds? Was it Freifunk? It might be worth writing to them to ask what their caching setup is.

nodesocket 2900 days ago

I've always thought TTL less than 60 seconds should be avoided, as some upstream DNS resolvers will ignore values less than 60 seconds and use a default long value. You are saying this is not true and a TTL of 10 seconds can safely be used?

colmmacc 2900 days ago

I think it's safe, based on a lot of experiments. We use 5 seconds for S3 ...

    ;; ANSWER SECTION:
    s3.us-east-1.amazonaws.com. 5	IN	A	52.216.165.117

One of the biggest, highest traffic, systems on the internet!

iowahansen 2900 days ago

Traffic is coming back. Looks like Google fixed their load balancer problem within 28 minutes.

tango24 2900 days ago

Hmm, this comment says it’s been happening for hours (below). Maybe their status page isn’t accurate

https://news.ycombinator.com/item?id=17552693

jagthebeetle 2900 days ago

I'd wait for further details from the status page, but as a GCP employee (for whatever that claim's worth on the internet), I'm not seeing evidence of an issue earlier than 12:15 PDT.

partiallypro 2900 days ago

I'm sure it was a cascading event, similar to the one Amazon had yesterday on their own site. Started small until it snowballed and effected everyone.

ti_ranger 2899 days ago

> We host DNS at AWS, but servers in GCP. Should we use AWS's automatic DNS failover feature to cover for such a case?

Well, I would avoid any of GCP's 'Global' features, they are an availability risk.

AWS's approach is to rather have inter-region replication, and there are lots of new features that support this.