Hacker News new | ask | show | jobs
by engine_y 777 days ago
We've been using Traefik in prod for 2 years. While I used NGINX in the past, I decided to migrate to Traefik mainly because of the automatic let's encrypt integration. I am sorry for that decision. Traefik's documentation does not make sense to me or my team. It is finicky and misbehaves without proper logging. As an example - when I want to recreate the certificates - it fails sporadically leaving prod down for an indefinite amount of time.

We're moving back to NGINX.

9 comments

This is one area where I've found nixos to be really helpful. I can set this up with just adding some lines to the configuration.nix (which uses lego(1) and letsencrypt in the backend):

  security.acme = {
    acceptTerms = true;
    defaults.email = "admin-email@provider.net";
    certs."mydomain.example.com" = {
      domain = "*.mydomain.example.com";
      dnsProvider = "cloudflare";
      environmentFile = "/path/to/cloudflare/password";
    };
  };
  
  services.caddy.enable = true;
  
  services.caddy.virtualHosts."subdomain1.mydomain.example.com" = {
    extraConfig = ''
      reverse_proxy 127.0.0.1:1234
    '';
    useACMEHost = "mydomain.example.com";
  };

Configuring with nginx is also fairly similar I think.

1. https://github.com/go-acme/lego

Nice, I am about to look into wildcard certs w/ nixos. Looks like it all 'just works' as long as you use a supported DNS provider?
Kudos to nix again!
I have moved to Traefik from NGINX aswell because of the built-in support for DNS challenge and wildcard cert. I myself spent many hours trying to get it working for my domain I use at work. I used the same config I use at home (which works perfectly) but could never get it to actually do anything, even though the setup was identical. Same domain registrar with same API based on the same docker configs etc. Had all logs enabled and still I get no information what so ever about why my certificate could not be created. It simply defaulted back to its generated cert without trying it seemed. After two troubleshooting sessions and several hours of searching and troubleshooting I had to admit defeat and just use my own self-signed cert files. Very frustrating when you get no information about why it doesn't work. Just a silent failure and fallback.

Overall that has been my biggest problem with traefik. Its awesome when it works, but when it does not I always seem to have problems troubleshooting and/or finding the information I need in the docs.

At work we will start using Traefik in prod towards the end of the year. I hope Traefik and I will become better friends before that :)

> I have moved to Traefik from NGINX aswell because of the built-in support for DNS challenge and wildcard cert. I myself spent many hours trying to get it working for my domain I use at work.

Certbot has plugins that directly support many DNS registrars, and can automate configuration of Nginx. Using, for example, the CloudFlare plugin for DNS validation combined with the Nginx plugin for local config would solve your problem readily.

I’ve always just used go’s built in reverse proxy if I need an API gateway. You can adapt it to meet any specific need, easily find libraries to do common tasks (CORS, rate limiting, retries, etc), and the best part: no configuration language. You just write go.
I did the same thing. After some bad downtime from Traefik introducing breaking changes in a point release, I decided to write my own.

My reverse proxy offered a service mesh, live config reloads, managed TLS certs, and automatically rerouted traffic around down services. The whole thing was a few hundred LOC anyone could understand in its entirety. It ran in production for years unchanged and never caused an outage.

curious what are the performance characteristics here? I would assume something like Nginx that has been optimized over a longer period of time / a more specific use case would have non-negligible performance benefits at scale?
Not everything needs to be at “scale”. I’ve deployed this pattern over 10k req/sec but it’s all about your SLOs. I’ve (thankfully) never needed to lose sleep over a millisecond or 2 in my line of work.
> I decided to migrate to Traefik mainly because of the automatic let's encrypt integration.

You probably already know and maybe it didn't work for you, but there's quite a few Docker companion containers that automate let's encrypt certs for an nginx Docker container.

I had much of the same issues early on in my Traefik experience. Things like using TLS-01 validation but not having DNS records set before config was applied would cause a lot of frustration. Like you, I was frustrated with the amount of logging I was getting. I eventually learned that not having DNS configured appropriately would lead validation attempts to fail after N unsuccessful attempts, and LE would refuse to do another TLS-01 validation for a while, which sounds like the kind of issue you were having.

After moving to DNS-01 validation, which comes with the added benefit of letting me cut certs for services that aren't publicly exposed with way less orchestration required than with TLS-01 style validation, my experience was suddenly much better. Assuming the DNS provider is working (and if it's not, you're hopefully getting an API error from them before LE attempts to validate the record, the failure state happens well before any check failure backoffs happen at LE. At this point, regardless of whether I'm using Traefik, Caddy, Nginx, or any other reverse proxy, I'm pretty committed to only using DNS-01 based validation from LetsEncrypt from now on, or if I have to do TLS-01 based validation, to make darn sure things are right the first time with the Staging API first.

Which, speaking of, if you cut a Staging cert with LE via Traefik, there's no good way to invalidate the staging cert. You have to munge the ACME JSON to remove the cert and restart Traefik (could maybe do a SIGHUP? didn't try) to get it to pickup the changes.

All said, lots of weird silent failures and behaviors, but the biggest pains are making dependent service errors opaque.

I use NGINX and Traefik in prod at work, and for my personal stuff I only use NGINX. It's all just orchestrated containers, no ingress controllers or similar magic anywhere.

I agree with your comments about Traefik being finicky, and would like to add that my very basic inhouse solution to do automatic Let's Encrypt integration (that also works with other ACME compatible CAs) is ~30 lines of bash, which is ran by cron every day. It's rock solid simply by failing hard when standard return codes fail. Monitoring for failed certificate renewals is as easy as handshaking with the endpoint and parsing the NotAfter field in the OpenSSL output. I run this as part of my regular HTTP endpoint monitoring solution at it tells me if any certificate will expire within 14 days.

The absolute worst failures I've experienced is having new domains start with a self-signed certificate until I reloaded nginx manually, and that I had 2 weeks to jump in and sort out some error because a certificate renewal failed.

So at least in my experience it turns out that LE-integration isn't a strong selling point. Logging and ease of configuration is. NGINX is not perfect in those aspects either, but it is a bit more robust and well-documented at least.

You might be happy to know that integration between Let's Encrypt and Nginx is something that's been provided by Certbot for years. The Nginx plugin for Certbot will identify active domains from your Nginx config, create and renew certificates, automating domain validation through the web server in real-time, and will automatically update your config files with both certificate paths and HTTP redirects to HTTPS (if desired).
Which is what I used for years, but recently discovered that Certbot now requires snapd to be installed. I did that and snapd bricked my server: it wouldn't start until I uninstalled it. That's when I switched to Caddy.
That's very definitely not true. Perhaps they're defaulting to Snap for convenience, but Certbot is a cross-platform Python program, and can just be installed via pip: https://certbot.eff.org/instructions?ws=nginx&os=pip

Non-Ubuntu distros also often have standard packages in their repos with no reference to Snap, and EFF also distributes a Docker container with Certbot pre-configured, if Docker is your thing.

I wasn't aware of that. It was true for my version of Ubuntu (18), according to the website: https://certbot.eff.org/instructions?ws=nginx&os=ubuntubioni...

Perhaps I had other options the website didn't make me aware of, but it seemed like enough of a hassle that I just dropped it.

Ubuntu is the one forcing the use of Snaps, and it's one of the reasons lots of people are abandoning Ubuntu on both server and desktop. You're going to run into this semi-regularly with a variety of software if you continue to use Ubuntu.

FWIW, Certbot is available in the standard repos for almost all other major distros.

Tnx. That's helpful to know.
I've considered Traefik for that too. We had 1800+ domains, so automatic TLS would be useful. But the OSS version didn't have good options for certificate storage imo.

Ended up using nginx and adding a .well-know/certbot endpoint orso that used lua to call certbot. Some bash, rsync & nfs for config management, never had an issue with it. Not fully automated, but close enough. And very debuggable!

We’ve been using Nginx is prod for 3 years. While I used Traefik in the past, I decided to migrate to Nginx mainly because of its scriptability (Traefik plugins suck). I am sorry for that decision. Nginx’s documentation is absolute trash full of non-explanations (far worse than Traefik or Caddy. It’s finicky and misbehaves constantly. Lurking around every corner is a decision from 1995 sticking around in 2024; Nginx can barely function on the modern internet without _significant_ tuning.

On top of it, the OpenResty community must be the rudest, most entitled people in the entire internet. Have a question, “YOURE DOING IT WRONG IDIOT” is the response. Of course every terrible decision they’ve made they justify with “BUT THE PERFORMANCE” as that’s the only thing worth considering.

We’re moving back to Traefik, or Caddy, both still in POC.