Hacker News new | ask | show | jobs
by xref 1816 days ago
That’s a great writeup, thanks for all the detail!

I was always worried about something like this happening so only ever provisioned (via ansible) one server at a time. When the logs showed it was fully synced, we provisioned the next node. It could take two days to add 10 nodes but I always felt much safer

2 comments

On the cloud, it is likely simpler and faster to just spin up a new cassandra datacenter, and then do a rebuild from the old datacenter to the new datacenter, either all nodes at once in parallel or in smaller batches. This procedure works fine regardless of using static tokens allocation or vnodes, and adds very little load to the old datacenter which is still serving traffic.
This is the standard approach and the one we have detailed runbooks for. We've scaled the cluster fine one at a time after this experience. It also prompted us to get a much better understanding of all the other flags that have been changed beyond the defaults.