Hacker News new | ask | show | jobs
by bobfunk 1913 days ago
Netlify CEO here. I'll try to answer the questions from the thread so far:

Some of our customers are affected by an outage of Googles Load Balancer.

These customers are not taking advantage of our DNS management, or they are not using a DNS provider that supports CNAME flattening and are using their root domain name for their website (ie, no www prefix).

While we don't recommend the setup, we do provide a single IP address to bind an A records for customers that want it.

In general we run our edge infrastructure as a large multicloud setup spanning several different network providers, and offer two separate networks, one for free/self-serve customers that will get newer features faster and one for enterprise customers running mission critical projects where we guarantee very high uptime and reliability through formal SLAs.

The single IP mentioned above however corresponds to a Google Load Balancer, and they are unfortunately currently having an outage for all load balancers in the relevant region. Read more on https://status.cloud.google.com/

Again, while we generally don't recommend using the A name setup for anything mission critical, we are currently doing everything we possible can helping enterprise customers that have chosen this setup to change their configuration.

Really sorry for all the trouble this are causing for our users, full RCA will be forthcoming.

8 comments

> These customers are not taking advantage of our DNS management

I think I understand the point you are trying to make, that customers who are utilizing Netlify DNS Management are unaffected because reasons, but this is phrased in a way that implies that it is your users fault for this downtime because they didn't chose to use your related service.

Full RCA with the steps the team has taken to improve this setup will be coming soon. The main issue with AWS's DNS solution, in this context, is that they don't support ALIAS records or similar techniques (CNAME flattening, etc) for A records pointing to any external provider. That limits our options a lot in terms of what we can do, since anyone using this setup need to point all their traffic to one or more fixed IP addresses.

Our current solution for the free/self-serve tier of Netlify has been to rely on Google's load balancer product to give people a stable IP pointing to a highly available solution. In light of recent issues, our team has setup a new permanent IP for A records (75.2.60.5) backed by a different solution, but due to the way DNS providers with no ALIAS record support work, it does require our customers to manually change their A records.

I totally get that moving DNS providers is a big deal and we want to give the best experience we can regardless of what provider you're on, but we have to work within the technical limitations of those providers and it's the nature of things that we do have more options to deliver a completely seemless experience when we operate both the DNS and the edge layer for customers.

Route 53 General Manager here. Flattening of external provider CNAMEs has a number of availability and accuracy risks. Route 53 offers a 100% availability SLA, and we really mean it. We’ve heard over and over from customers that reliability is our most valuable feature. We can’t provide that same reliability when external queries are in the mix; if we query asynchronously then features such as geo-based routing don’t work as expected for customers. If we query synchronously, then latency and availability are impacted directly.

We do offer ALIAS records between Route 53 hosted zones, and this capability is open to providers such as Netlify. We’d be happy to have customers ALIAS to a hosted zone managed and updated by Netlify. It sounds like your IP addresses are relatively stable, keeping these in sync doesn’t sound like it would be a big deal, and would give you a lever you could pull to change your customer DNS quickly in an event such as this. You could also configure health checks on your own DNS records, which any customer ALIAS records that point to your DNS records in Route 53 would inherit.

If you’re interested in going this route, please contact me at alecpete <at> amazon <dot> com.

If each Route 53 POP is already close to the querying DNS client, then things like geo routing with cached answers might just work well enough in most cases? With each POP having its own cache.

Auto-refreshing the popular records in the background before the TTL expires to help smooth over any temporary issues?

Other big name DNS providers have ALIAS type records. I imagine according to the SLA, AWS Route 53 is still "available", even if it can't resolve a "target address record" (as the ANAME draft calls them) but Route 53 is still able to respond.

Phrasing can always be better but the point is that there's a way to map your DNS to Netlify which is risky and Netlify hasn't made the aggressive decision of blocking it. They outline in their docs all the reasons why you shouldn't do it, provide instructions for how to avoid it and also offer (but do not require) a hosted DNS setup which avoids this pitfall by design.

Some folks still choose to use this way, some have no other choice for various reasons and some don't care/comprehend the potential pitfalls. I do believe most users avoid using a root domain name for their website.

> I do believe most users avoid using a root domain name for their website.

This is where you're definitely wrong.

I could be. Are you saying this based on data or intuition?
As someone who is a little clueless about network infrastructure: if I own "dwrodri.com", and I'm not running a bunch of other services which need to point to this domain, is there any reason why I wouldn't have my root domain pointed to my personal website?

I would personally imagine that any individual or SOHO business hosting their website on GitHub/GitLab would just buy "MomAndPopShop.com" and point it there. I guess I don't know off the top of my head how many of those sorts of places on the web still exist...

The problem is not that they're pointing their apex domain to a personal website; the problem is that they have a CNAME record in place for their apex domain, which is not actually allowed per the DNS standards
Sadly, even after switching to their DNS I am still affected.
This should not be the case; if you'd like, Netlify's Support team will be happy to review your settings to help discover why it didn't help you out (start from https://netlify.com/support) and ensure that you are "futureproofed"!
I can heartily recommend contacting _fool for support at Netlify. Always an absolute pleasure.
I switched to using your DNS to resolve this issue, but https://js.la is still busted and because I'm using your DNS, I can't manually set the A record to go to the workaround IP address.
Hi Bob, just want to say, I like your service a lot.
Thanks! Appreciate the kind words!
Seconded, I use it for all my static hosting. Great service.
"These customers are not taking advantage of our DNS management"

You're right. I'm using Cloudflare's DNS. I trust them more than I trust Netlify and that's just a function of their size vs Netlify's size. This response needed better wording.

Cloudflare DNS supports CNAME flattening and you won't be needing the fixed IP address if setting up DNS with them.
More details for folks who are curious about optimal config using Cloudflare's DNS hosting, can be found here: https://answers.netlify.com/t/support-guide-which-are-some-g...
Depending on your config, another DNS related issue with Netlify is the way NS1.com (their vendor) handles domain names. A domain can only be added to one NS1 account. So if Netlify adds to their account internally, you can't use NS1 and vice versa.
Are you all having shake ups within the company? I'm not going to deep dive, but I heard some rumors about some higher ups leaving.

After the Cloudflare Pages release, I'd be curious of what your future road map looks like and how you all plan to compete and grow.

Thanks for all you and your team does. What you have done for front-end development and the community has been nothing but awesome and inspiring.

Honestly, "not taking advantage of our DNS management" is a garbage response. We use AWS for our DNS management. If you offer a configuration, you should support it fully.

Our sites have been down for 3 hours now, and you're blaming someone else? We have 5 properties on Netlify now and will have 0 this time next week.

> Our sites have been down for 3 hours now, and you're blaming someone else?

Well if the issue is at Google then maybe "blaming" isn't really the right word. No need to be rude.

I might as well make the same argument for your sites.

- Your sites have been down for 3 hours now, and you're blaming someone else?

Yes, it is our fault for believing Netlify had contingency plans as hosting is their core business. We're fixing this mistake now so that our customers don't have the same experience.
By the same line of reasoning, your customers could be faulted for believing you had a contingency plan.
Nobody is telling parent's customers how to feel. But the OP suggests that Netlify customers should be faulted for choosing the the wrong setup. Broken trust goes all the way down the chain, which is why the middle links have every reason to get ticked off.
The difference is that Netlify communicated the risks to its customers, something other parts of the chain apparently did not do, in addition to not evaluating the risks presented to them by Netlify.
Point your apex domain to 75.2.60.5, Netlify recommends it here [0] and in their documentation now [1].

I just did for a site that's hosted by Netlify and it solved the issue. Thankfully I had a short TTL, I hope you do too.

[0] https://www.netlifystatus.com/

[1] https://docs.netlify.com/domains-https/custom-domains/config...

I'm not sure your organization's setup with Netlify but isn't the whole point of Serverless to be... "serverless"? I could migrate twice the amount of properties you have to another provider in less than 3 hours...

I get your frustration but maybe cut some slack. If anything is mission critical, you should have had a backup plan if Netlify, Vercel, Cloudflare, or something else.

We use(d) Netlify for the frontend. I agree, our mistake was believing Netlify could be used for more than toy websites and took care of backup plans for us. Clearly they do not.
I do believe you to be trolling now by saying that. If not, congrats on the valuable lesson!
Not trolling, just very frustrated. But yes a valuable lesson.
What's keeping you from migrating your frontends? Shouldn't that take a couple of hours at worst?
This could've been avoided with an HTTP LB, vs a L4 one...