Hacker News new | ask | show | jobs
by vruiz 3994 days ago
> Any LB goes down, and DNS client retries will deal with it.

How? How does the DNS client know that the IP no longer works? do browsers today have this mechanism?

I'm not a network guy so perhaps I'm wrong but it's my understanding the problem with DNS load balancing is that you can not invalidate the TTL on the client.

2 comments

It is up to the client. But all of the clients (browsers) out there do more or less the same thing.. they try the first DNS record.. if no response in ~30 seconds, try the second, and so on - going down the list.

TTL does not matter here because I am not yanking or adding to my DNS record. I am simply saying "Here are 3 servers.. try them in order until you find one that works".

In practice, a helpful feature is

a) Most clients try them in order from top to bottom b) Most DNS servers (including Digital Oceans) randomize the return order.

So if you do 2 dns requests, the first will return 1.2.3.4, 1.2.3.5, 1.2.3.6, and the second will return 1.2.3.5, 1.2.3.6, 1.2.3.4

This has the double benefit of splitting traffic more or less evenly between my load balancers, and dealing with things with one or more is dead.

I'm not sure all clients will behave as you are experiencing. But in any case:

> if no response in ~30 seconds, try the second

That is not HA. Most people will not wait 30 seconds for a page to load. If your business looses money with every minute of downtime this is certainly not adequate. It's certainly not recommended https://en.wikipedia.org/wiki/Round-robin_DNS#Drawbacks

Name a business that has not had a 30 second outage in the past year?

How about services you use a lot. How many hours has hacker News been down in 2015 (yet you are still on it right now)? How many hours has netflix been down in 2015? How many hours has entire chunks of AWS been down in 2015?

Every business is a spectrum. A HFT trading shop may decide that 1 second downtime per day is their max outage. A webpage advertising a pet adoption event may decide that 6 hours of downtime per day is the most they can tolerate. You have to make this decision for each product, and even better -each part of each product.

The entire point of this post thread was the idea behind "I can not use DO for serious stuff until they implement load balancing"... which is silly for most businesses. And even those businesses that need high uptime, I offered (and still believe) that DNS round robin is a decent way to get HA for almost no money.

You link to an article about it, but miss the boat. What other solution can I implement in a few minutes to provide available load balancing between any two servers in the world (same or different host provider, same or different datacenter, same or different continent).

Sometimes the relatively simple solution is "good enough". Sure you can find a wikipedia page saying where it is not perfect. I would not DNS round robin a HFT trading app. I have no problems on it for 99.99% of the web though. So much of the web has NO failover of any kind, stupid simple DNS round robin would be a vast improvement for most websites.

> Name a business that has not had a 30 second outage in the past year?

It's not a 30 second outage! Your domain will keep resolving the bad IP. Even with an extremely low TTL (also not recomendable) ISP's DNS will cache it and even some will ignore your TTL. A big portion of all new users will keep hitting the bad IP.

Anyway, I won't try to convince you to change your setup if you are happy with it, but it's obvious from the comments that I'm not the only one thinking it's a suboptimal solution, so at least some of us won't be considering DO for HA systems given the circumstances.

> It's not a 30 second outage! Your domain will keep resolving the bad IP. Even with an extremely low TTL (also not recomendable) ISP's DNS will cache it and even some will ignore your TTL. A big portion of all new users will keep hitting the bad IP.

So with 5 load balancers, 1/5 of customers see a one time hit of 30 seconds (after which they return to full speed).

What better solution for the same price do you propose to get HA on a budget cloud provider?

> What better solution for the same price do you propose to get HA on a budget cloud provider?

Nothing, your solution is obviously better than having none and it's enough for your needs. But the original discussion was about what's needed for DO to become a competitor for big business, not low budget, there they are already king.

Normally you want at least IP failover, meaning that you get an IP and can be rerouted to a different server with an simple API call. At work we use hetzner, which is not exactly a high-end provider but offers it: http://wiki.hetzner.de/index.php/Failover/en

Then, can be even better if the provider offers this HA-load balancer as a service, so you don't have to setup anything.

You might still need DNS failover to recover from a full datacenter going offline.

Your view is an accurate view. It takes the end user -- be it some sort of client, browser, or manual user retry -- to hit the other, alive IP(s). There's also the TTL of a bad record being dropped to consider.

You can simulate IP failover with something like Elastic Network Interfaces / Elastic IPs in AWS... it's just not going to be on the same level of speed as doing it in, say, your own rack in a datacenter. It's also subject to weirdness where you could have some sort of split brain, nodes trying to take over interfaces in a loop. The health checked "multiple load balancers behind a single DNS record" approach has flaws but also simplifies a lot of things.