The selling point of GKE etc. is “minimal to no maintenance,” but of course somebody else is doing the maintenance and the customer is paying a premium for it. Says great things about Nomad.
Yeah, when making the decision it was quite harrowing to think of maintaining a cluster in production. Nomad had very little operational complexity compared to what we imagined.
We've had two main outages in months:
- Server disks were filling up and we hadn't set up monitoring properly at the time (ironic for the name of our company :) ). Not Nomad's fault.
- A faulty healthcheck caused all the servers of a cluster to restart at the same time, which caused complete loss of the cluster state (so all the jobs were gone. I like to call it a collective amnesia of the servers).
We're still looking for a good/reliable logging and tracing solution though. Nomad has a great dashboard, but only with basic logging, and it only gets you so far.
Jaeger is pretty great for tracing, and can integrate with Traefik/Envoy ( or whatever you use for ingress/inter-service communication).
We're running Loki for the logs ( via nomad log forwared/shipper and promtail) and so far it's going great. I'll have to do a write-up about the the whole thing.
We've had two main outages in months:
- Server disks were filling up and we hadn't set up monitoring properly at the time (ironic for the name of our company :) ). Not Nomad's fault.
- A faulty healthcheck caused all the servers of a cluster to restart at the same time, which caused complete loss of the cluster state (so all the jobs were gone. I like to call it a collective amnesia of the servers).
We're still looking for a good/reliable logging and tracing solution though. Nomad has a great dashboard, but only with basic logging, and it only gets you so far.
Overall, would recommend again!