Hacker News new | ask | show | jobs
by lclarkmichalek 3144 days ago
Cool, I've a pretty big cluster with some GC issues (p90 - 15s, p99 - 60s) during node failures, and would be super interested in those results! If there's anything a user can do to help, my email is on my user page :D
1 comments

We observed in past that long GC is the cause of node failures. When long GC happens node doesn’t respond, master node decides that this node had left the cluster :\
Ya, we often see a node die of natural causes, and then the garbage produced from recovering the node and relocating the data ends up bringing down the rest of the cluster via long GC pauses.