Hacker News new | ask | show | jobs
by ergl 3368 days ago
Jane Street uses the same approach to build their exchange [0]. Like the doc says, it can be great to replay some sequence of messages in dev to reproduce issues, and to give fault-tolerance to the system.

One downside is that, if all your nodes are using the same application code, simply replaying the log might not help as all nodes might hit exactly the same bug with the same sequence of transitions.

[0] There's an overwiev of their infrastructure here https://youtube.com/watch?v=b1e4t2k2KJY

1 comments

Thanks for sharing the video and great talk btw. Brian, the speaker, actually asks the audience (around minute 20 in the video) if anybody use paxos for the matching engine. What I'm talking about in the article is exactly that: we're just using another consensus algorithm (Raft) which is significantly simpler to implement than Paxos.

LMAX use synchronous replication in their exchange: https://www.infoq.com/presentations/LMAX

What kind of latency does the consensus add? We are looking at adding fault tolerance to our matching engine but can only afford 10-15 micros.
Related to the latency question, I just watched the Jane Street video (very nice!) and he mentioned that they use operator-initiated failover and he didn't know of anyone using a consensus based approach because it adds an extra hop. Does your Raft-based failover solution do automatic failover?