Hacker News new | ask | show | jobs
by FAANG_dream 1951 days ago
Ohh for sure!

I tried implementing Raft, which is supposed to be the most understandable out of all available consensus algorithms, but wasn't able to make it work 100%.

I came close but in the end gave up on resolving some concurrency issues :/

3 comments

I'm not sure if you are aware of it, but check out MIT 6.824 on opencourseware (pick 2020, not 2021).

In Lab 2 they provide you with a frame for raft with RPC, etc. already in place leaving only protocol for you. Lab is also split into 3 logical parts - leader election, append entries, persistence. Highly recommend.

You are not alone: https://news.ycombinator.com/item?id=23123701 and https://youtu.be/QVvFVwyElLY?t=2502

Like TFA points out, distributed consensus is punishingly hard. AWS relied on TLA+ to prove consensus in DynamoDB and other systems https://lamport.azurewebsites.net/tla/formal-methods-amazon....

Interestingly, Kinesis [0] and SQS (?) avoid consensus for those same reasons.

[0] https://news.ycombinator.com/item?id=25239100

Chain Replication (and friends) are vastly simpler than Paxos (and friends) in many ways, but do have the same requirement for determinism. That's because chain replicated systems typically need to be confluent (see https://pathelland.substack.com/p/dont-get-stuck-in-the-con-...), which means that all the replicas need to have the same value in them when replication is done. Conceptually simpler, for sure, but many of the same challenges remain.
How did you know there were still concurrency issues? I would expect issues like that to be relatively subtle.

Did you build out a test suite?

Heh... knowing for certain that you've gotten all the concurrency issues is hard. Knowing that a specific code base has concurrency issues can be very easy, by virtue of them clearly exhibiting bugs. I haven't tried to implement RAFT myself but I've certainly had code bases that clearly had concurrency bugs in them, even if I didn't know exactly what they were. :)