Hacker News new | ask | show | jobs
by danburkert 2638 days ago
Does YugaByte still use the Raft and HybridTime implementations from Apache Kudu? If so, how relevant are these results for Kudu?
2 comments

I wanted to add a few details to the previous reply.

While the Raft/HybridTime implementation has its roots in Apache Kudu the results will NOT be quite applicable to Kudu. Aside from the fact that the code base has evolved/diverged over the 3+ years, there are key/relevant areas (ones very relevant to these Jepsen tests) where YugaByte DB has added capabilities or follows a different design than Kudu. For example:

-- Leader Leases: YugaByte DB doesn't use Raft consensus for reads. Instead, we have implemented "leader leases" to ensure safety in allowing reads to be served from a tablet's Raft leader.

-- Distributed/Multi-Shard Transactions: YugaByte DB uses a home grown (https://docs.yugabyte.com/latest/architecture/transactions/t...) protocol based on two-phase commit across multiple Raft groups. Capabilities like secondary indexes, multi-row updates use multi-shard transactions.

-- Allowing online/dynamic Raft membership changes so that tablets can be moved (such as for load-balancing to new nodes).

regards Kannan (Co-founder @ YugaByte)

FWIW, we implemented dynamic consensus membership change in Kudu way back in 2015 (https://github.com/apache/kudu/commit/535dae) but presumably that was after the fork. We still haven't implemented leader leases or distributed transactions in Kudu though due to prioritizing other features. It's very cool that you have implemented those consistency features.
hi @mpercy,

Thanks for correcting me on the dynamic consensus membership change. Looks like the basic support was indeed there, but several important enhancements were needed (for correctness and usability).

- To make the "online" piece of the membership change work correctly we added support for LEARNER (PRE VOTER) role (where the new member enters in a non-voting mode till it's caught up). https://github.com/YugaByte/yugabyte-db/commit/909d26e31ecd0....

- Load Balancing (which uses the membership changes) is automatic. (https://github.com/YugaByte/yugabyte-db/commit/e4667eb7ec0e6...)

- Remote bootstrap (due to membership changes) also has undergone substantial changes given that YugaByte DB uses a customize/extended version of RocksDB as the storage engine and does a tighter coupling of Raft with RocksDB storage engine. (https://github.com/YugaByte/yugabyte-db/blob/master/docs/ext...)

- Dynamic Leader Balancing is also new-- it causes leadership to be proactively altered in a running system to ensure each node is the leader for a similar number of tablets.

regards, Kannan

Interesting. Just last year we implemented improved re-replication (https://github.com/apache/kudu/commit/79a255) which sounds very similar to what you did with LEARNER roles, and we added manually-triggered rebalancing (https://github.com/apache/kudu/commit/ccdcf6 and https://kudu.apache.org/releases/1.8.0/docs/administration.h...).

I'm curious if you did anything to prevent automatic rebalancing from being triggered at a "bad time" or have throttled it in some way, or whether moving large amounts of data between servers at arbitrary times was not a concern.

I am also curious if you added some type of API using the LEARNER role to support a CDC-type of listener interface using consensus.

By the way, we also recently added support for rack/location awareness in a series of patches including https://github.com/apache/kudu/commit/ebb285

We should really start some threads on the dev lists to periodically share this type of information and merge things back and forth to avoid duplicating work where possible. I know the systems are pretty different at the catalog and storage layers but there are still many similarities.

Yes, it does. At the core, the raft implementation is still based on kudu's. But, these areas have been worked on actively so the implementations might has diverged a little.

May be worth looking through the individual issues to see what applies and what doesn't:

https://github.com/YugaByte/yugabyte-db/projects/11