Hacker News new | ask | show | jobs
by mdaniel 1031 days ago
> our nomad cluster would go down without any way to recover,

for my curiosity, was in Nomad or Consul that fell over? My experience with etcd leads me to suspect it was actually a consul fire, but since I (thankfully) have never run Nomad I don't know first hand about its dragons

1 comments

I’m trying to remember. I think it was nomad.

Essentially, about once a week the raft pings between the nodes would result in no response from the primary, so then the cluster would assume it lost the node and try to hold an election to pick the new leader and get stuck in a loop cuz it kept indefinitely trying to ask the leader for it’s vote.

I thought, surely this software isn’t that stupid. But it was.

The recovery documentation was to hand-generate a peers.json file. Seriously? In 2013 when you have a million ways to do auto discovery? Including in your own software? it couldn’t just auto heal?

I managed a MongoDB cluster ten years ago and never had a single issue like this. I could routinely take nodes down and bring them up and the cluster healed perfectly.