We observed in past that long GC is the cause of node failures. When long GC happens node doesn’t respond, master node decides that this node had left the cluster :\
Ya, we often see a node die of natural causes, and then the garbage produced from recovering the node and relocating the data ends up bringing down the rest of the cluster via long GC pauses.