Hacker News new | ask | show | jobs
by kmuthukk 2642 days ago
hi @mpercy,

Thanks for correcting me on the dynamic consensus membership change. Looks like the basic support was indeed there, but several important enhancements were needed (for correctness and usability).

- To make the "online" piece of the membership change work correctly we added support for LEARNER (PRE VOTER) role (where the new member enters in a non-voting mode till it's caught up). https://github.com/YugaByte/yugabyte-db/commit/909d26e31ecd0....

- Load Balancing (which uses the membership changes) is automatic. (https://github.com/YugaByte/yugabyte-db/commit/e4667eb7ec0e6...)

- Remote bootstrap (due to membership changes) also has undergone substantial changes given that YugaByte DB uses a customize/extended version of RocksDB as the storage engine and does a tighter coupling of Raft with RocksDB storage engine. (https://github.com/YugaByte/yugabyte-db/blob/master/docs/ext...)

- Dynamic Leader Balancing is also new-- it causes leadership to be proactively altered in a running system to ensure each node is the leader for a similar number of tablets.

regards, Kannan

1 comments

Interesting. Just last year we implemented improved re-replication (https://github.com/apache/kudu/commit/79a255) which sounds very similar to what you did with LEARNER roles, and we added manually-triggered rebalancing (https://github.com/apache/kudu/commit/ccdcf6 and https://kudu.apache.org/releases/1.8.0/docs/administration.h...).

I'm curious if you did anything to prevent automatic rebalancing from being triggered at a "bad time" or have throttled it in some way, or whether moving large amounts of data between servers at arbitrary times was not a concern.

I am also curious if you added some type of API using the LEARNER role to support a CDC-type of listener interface using consensus.

By the way, we also recently added support for rack/location awareness in a series of patches including https://github.com/apache/kudu/commit/ebb285

We should really start some threads on the dev lists to periodically share this type of information and merge things back and forth to avoid duplicating work where possible. I know the systems are pretty different at the catalog and storage layers but there are still many similarities.