Hacker News new | ask | show | jobs
by ericbarrett 1957 days ago
The selling point of GKE etc. is “minimal to no maintenance,” but of course somebody else is doing the maintenance and the customer is paying a premium for it. Says great things about Nomad.
1 comments

Yeah, when making the decision it was quite harrowing to think of maintaining a cluster in production. Nomad had very little operational complexity compared to what we imagined.

We've had two main outages in months:

- Server disks were filling up and we hadn't set up monitoring properly at the time (ironic for the name of our company :) ). Not Nomad's fault.

- A faulty healthcheck caused all the servers of a cluster to restart at the same time, which caused complete loss of the cluster state (so all the jobs were gone. I like to call it a collective amnesia of the servers).

We're still looking for a good/reliable logging and tracing solution though. Nomad has a great dashboard, but only with basic logging, and it only gets you so far.

Overall, would recommend again!

Jaeger is pretty great for tracing, and can integrate with Traefik/Envoy ( or whatever you use for ingress/inter-service communication).

We're running Loki for the logs ( via nomad log forwared/shipper and promtail) and so far it's going great. I'll have to do a write-up about the the whole thing.

Thank you for the pointers, very helpful. I'd love to see that write up too!
I'd love to see your write-up on thr logging thing. Please do!
Would love to see that write-up!