Hacker News new | ask | show | jobs
by mavelikara 4060 days ago
I have a slightly off-topic question for Diego and other people experienced with distributed systems.

Why are consensus algorithms always developed as systems, not as libraries? Zookeeper, etcd and LogCabin all operate as a cluster of processes which other nodes connect to over a client library.

I can imagine that the distributed-state-machine-replication-mechanism of Raft or ZAB being implemented as a library where the user has to provide an implementation of the communication layer. Such a library can be used as a starting point for building other more complex systems which aim to provide a low-friction install experience. For example, one good thing about both Cassandra and ElasticSearch is that they both have homogeneous clusters where all nodes play the same role. Incidentally, from what I understand, they both embed a consensus implementation within.

Similarly, a membership service (and failure detector) over gossip protocols will also be very useful.

An installation guide which starts with "First, install Zookeeper cluster. Then, install a SWARM cluster. Then ..." is not very appealing. That being the case, I wonder why there is no mature OSS library which provides these services. What does HN think about this situation?

7 comments

This question came up at the CoreOS Fest earlier today too. I think everyone wants libraries as well as services, and I'd like that too for LogCabin one day. It's just a bit harder when you start thinking about the details. It's not impossible though. As one example, CockroachDB is using etcd's Raft implementation as a library.

Part of the issue is that we want to have libraries with small simple interfaces, and a consensus library is probably going to be on the large side, as it has to interface with the disk, the network, and the state machine, plus membership changes, log compaction, who's the leader, configuration settings, debug logging, etc.

Another issue is that, as a library, it has to be in the language you're using. And if that language is C++ or similar, it has to be compatible/convenient with whatever threading approach you're using.

Then there's performance/memory. Some Raft implementations are designed to keep relatively small amounts of data in memory (like LogCabin or etcd), and others are meant to store large amounts on disk (like HydraBase). Some are optimized more for performance, others for safety.

I think we'll get libraries eventually. Keep in mind Raft is still very young. Paxos is around 26 years old, Raft is only .5 to 3 years old (depending on when you start counting). I like to think that Raft lowered the cost of developing a consensus system/library significantly, but it still takes time to develop mature implementations. Right now we have a lot of implementations in early stages; I wonder if some of these efforts will consolidate over time into really flexible libraries.

Thanks Deigo! From yours and other siblings comments what I gather is that the idea of a library is not fundamentally flawed, it is just that the state of the art has not matured to that level.
In my experience, consensus algorithms are developed as algorithms first, (reference) systems second, and libraries third. The reasons are simply that the algorithm is the most important part, and good libraries are hard. Diego has rightly focused on the algorithm, which has reached maturity before this LogCabin release.

I work on CockroachDB, which uses etcd's Raft implementation, so we are an existence proof that consensus libraries are possible. However, the library we share was the second attempt by the authors (Xiang Li and Yicheng Qin) to create a Raft implementation. The first attempt to develop a "library form" of a consensus algorithm is likely to reach a dead end, but over time it is possible to develop a library-style implementation.

A "library" that wants to have a process of some sort continuously running encounters the problem that the more cross-language generic you try to make the library, the less possible it is to work out a threading model that can be used by everyone, and the less language-generic you make it, the smaller your target audience is, which has non-linear affects on the usage and development resources it can attract. Languages that include a "blessed" runtime model like Node or Go can easily ship such "daemonized libraries", but then they are generally impossible to bind to for anybody else. Using POSIX threads, by contrast, will lock a lot of other languages out that have very sophisticated runtimes, be very difficult to use in others, and, even in C, when you get right down to it, there's no guarantee that they'll play nicely with the rest of the C program.

Distributing a service invokes the sort of lowest-common-denominator solution to this problem, which is "OS process". Everyone can talk to an OS process.

(This is an explanation, not advocacy or celebration.)

Well etcd started out using the goraft library, so this isn't always true, but I can think of some reasons I'd prefer to have the quorum as a separate system:

Writing it as a service lets you implement it in once in a high-level language like Java or Go and use it from all languages. To get the same portability in a library you'd have to write it in C. It's hard enough to maintain consistent data without worrying about memory corruption bugs.

The protocol implementation is only a part of durable consensus, you also need durable storage, sensible network timeouts, shutdown handling, etc.

You usually want different configurations for your application and your quorum. For the quorum you usually want 3, 5 or 7 members, while your application could be anywhere from one to hundreds of instances. Your quorum members must always know how many other members are supposed to be in the quorum, while your application could remove or add instances on the fly. For the quorum you need low latency, while you may want to optimise your application for throughput. (e.g. disks, garbage collection, swapping)

Service Fabric does this. You provide the stream of operations and Fabric replicates them using distributed consensus.

Databases can be hosted on the framework. DocumentDB is, for example. While at MS, I wrote a near-real-time metrics system using Fabric and worked on a tunable-consistency distributed cache built on Fabric.

Summary of Service Fabric here: http://daprlabs.com/blog/blog/2015/04/30/service-fabric-2/

The Rust Raft library[0] (not my project) is a library; you must specify how the store[1]/statemachine[2] works under the hood, though it comes with default implementations for both. I'd like for it to abstract over communication channels too, but currently it's tightly wired to an RPC and IO library and I don't have the time to try and fix that.

[0]: https://hoverbear.github.io/raft/raft/index.html

[1]: https://hoverbear.github.io/raft/raft/store/trait.Store.html

[2]: https://hoverbear.github.io/raft/raft/state_machine/trait.St...

There are at least two Raft libraries for Go, so they do exist.