Hacker News new | ask | show | jobs
by omneity 1957 days ago
Yeah, when making the decision it was quite harrowing to think of maintaining a cluster in production. Nomad had very little operational complexity compared to what we imagined.

We've had two main outages in months:

- Server disks were filling up and we hadn't set up monitoring properly at the time (ironic for the name of our company :) ). Not Nomad's fault.

- A faulty healthcheck caused all the servers of a cluster to restart at the same time, which caused complete loss of the cluster state (so all the jobs were gone. I like to call it a collective amnesia of the servers).

We're still looking for a good/reliable logging and tracing solution though. Nomad has a great dashboard, but only with basic logging, and it only gets you so far.

Overall, would recommend again!

1 comments

Jaeger is pretty great for tracing, and can integrate with Traefik/Envoy ( or whatever you use for ingress/inter-service communication).

We're running Loki for the logs ( via nomad log forwared/shipper and promtail) and so far it's going great. I'll have to do a write-up about the the whole thing.

Thank you for the pointers, very helpful. I'd love to see that write up too!
I'd love to see your write-up on thr logging thing. Please do!
Would love to see that write-up!